Pig User Defined Function

Pig UDF | Pig User Defined Function

Pig provides support for writing user defined functions (UDF) to make processing easy. UDF can be written in Java, Jython, Python, JavaScript, Ruby, and Groovy. Let’s take an example to understand how to write and use UDF in pig.

Below UDF will convert one format of Date( “EEE MMM dd HH:mm:ss z yyyy”) to other date format(“yyy-MM-dd HH:mm:ss”).

 
package com.dataunbox;
import java.io.IOException;
import java.text.SimpleDateFormat;
import java.util.Date;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;</pre>

public class ServerTimestampToHive extends EvalFunc < String >

 {
  static SimpleDateFormat sdfIn = new SimpleDateFormat("EEE MMM dd HH:mm:ss z yyyy");
  static SimpleDateFormat sdfOut = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss"); //1961-02-18 00:00:00

  @Override

  public String exec(Tuple input) throws IOException {
   if (input == null || input.size() == 0) {
    return null;
   } else
   {
    try

    {

     String sourceDate = (String) input.get(0);
     Date ourDate = sdfIn.parse(sourceDate);
     String outString = sdfOut.format(ourDate);
     return outString;

    } catch (Exception e)

    {
     throw new IOException("Caught exception processing input row ", e);
    }
   }
  }
 }

The above udf will convert “2014-03-25T03:23:28.332Z” to “2014-03-25 00:00:00”.

Convert the above program to java executable jar to use as pig UDF.

Notice that class extends ‘EvalFunc’ which is a base class of all Eval functions. Similarly load and store function needs to extend ‘LoadFunc’.

Now let’s use the pig UDF in the script.

To use the UDF, the First step is to register the pig UDF and use the class to process the column.

 
REGISTER /home/hadoop/PigUDFs.jar
REGISTER /home/hadoop/elephant-bird/json_simple-1.1.jar
REGISTER /home/hadoop//elephant-bird/elephant-bird-pig-4.5.jar
REGISTER /home/hadoop/elephant-bird/elephant-bird-core-4.5.jar
REGISTER /home/hadoop/elephant-bird/guava-17.0.jar
register /home/hadoop/elephant-bird/elephant-bird-hadoop-compat-4.5.jar

data_json = load '/inputdata/pig_tmp/data.json' USING com.twitter.elephantbird.pig.load.JsonLoader();
data = foreach data_json generate (chararray)$0#'id' as id, (chararray)$0#'userId' as userId,
com.dataunbox.UnixTimeToHiveString ( (long)$0#'acquisitionTime' ) as acquisitionTime,
com.dataunbox.UnixTimeToHiveString ( (long)$0#'storageTime') as storageTime

store data into '/data/format/$date/'; 

Leave a Reply