Apache PIG : Installation and Connect with Hadoop Cluster

Reading Time: 4 minutes

Apache PIG, It is a scripting platform for analyzing the large datasets. PIG is a high level scripting language which work with the Apache Hadoop. It enables workers to write complex transformation in simple script with the help PIG Latin. Apache PIG directly interact with the data in Hadoop cluster.

Apache PIG transform Pig script into the MapReduce jobs so it can execute with the Hadoop YARN for access the dataset stored in HDFS(Hadoop Distributed File System).

When we want to use Apache PIG we have to create Hadoop cluster and than run Pig on it. For creating the Hadoop cluster, you can follow this blog. While following this blog when we are creating hdfs-site.xml please make changes according to this.

xxxxxxxxxx

<property>

<name>dfs.replication</name>

<value>1</value>

</property>

<property>

<name>dfs.permissions</name>

<value>false</value>

</property>

Now when you have done with Hadoop, now it is time for install Pig. We can download latest version of Pig (pig-0.16.0) from here. After downloading we extract Pig from tar file and create a new Pig folder in /usr/lib/ and copy Pig within it.

xxxxxxxxxx

$ tar -xvf pig-0.16.0.tar.gz

$ mkdir /usr/lib/pig/

$ cd /usr/lib/pig/

$ mv /pig-0.16.0 /usr/lib/pig/

After copying Pig folder in /usr/lib/pig update the .bashrc file.

xxxxxxxxxx

export PIG_HOME="/usr/lib/pig/pig-0.16.0"

export PIG_CONF_DIR="$PIG_HOME/conf"

export PIG_CLASSPATH="$PIG_CONF_DIR"

export PATH="$PIG_HOME/bin:$PATH"

Now we have almost complete the process, restart the console or reload the .bashrc and we check we have install Pig correctly or not.

xxxxxxxxxx

$ pig -version

Screenshot from 2016-08-26 18:28:23

Now we have completed with the installation of Hadoop and Pig.

We can start Pig with these commands :

xxxxxxxxxx

$ pig -x local

$ pig -x mapreduce

or

$ pig

Here I want to discuss about these two commands for starting the Apache Pig with local or mapreduce.

When we run Pig at local mode, it will run on your local machine with the help of your localhost and local file system while when we run Pig on mapreduce mode, it will run on Hadoop cluster with HDFS.

By default Pig start in the mapreduce mode and when we want to run it in local mode we have to specify it. If we do not specify any mode it will run it mapreduce mode.

In other hand we created a folder in Hadoop cluster where we will keep all the script or text files.

xxxxxxxxxx

$ hdfs dfs -mkdir hdfs://localhost:54310/pig_Data

Screenshot from 2016-08-26 18:38:52.png

Screenshot from 2016-08-26 18:40:20.png

Now we put data file in Hadoop cluster.

xxxxxxxxxx

$ hdfs dfs -put /home/anurag/student_data.txt hdfs://localhost:54310/pig_Data/

Screenshot from 2016-08-26 20:50:00

Now we start Pig environment with the command :

xxxxxxxxxx

$ pig -x mapreduce

Screenshot from 2016-08-26 20:55:10

Now we are ready for running Pig commands on the Hadoop cluster.

1. We will LOAD file from the Hadoop cluster :

xxxxxxxxxx

students = LOAD 'hdfs://localhost:54310/pig_Data/student_data.txt'

  USING PigStorage(',') as ( id:int, firstname:chararray,

  lastname:chararray, phone:chararray, city:chararray );

Screenshot from 2016-08-26 21:03:23

2. We can STORE data directly on the cluster in the new directory. When we want
use STORE command it always make new directory. It does not use any present
directory.

xxxxxxxxxx

STORE students INTO ' hdfs://localhost:54310/pig_Output/ ' USING PigStorage (',');

Screenshot from 2016-08-26 21:28:31.png

We can see data from the both end :

In Hadoop cluster we can see data :

xxxxxxxxxx

hdfs dfs -cat hdfs://localhost:54310/pig_Output/part-m-00000

In Pig environment we can see data with :
xxxxxxxxxx
DUMP students;

We can run scripts directly from the command prompt with the run command. Here is an example for the Word Count. In this example we will LOAD a file from the HDFS and perform a Word Count operation on that file. So lets start :

We assume that we have a text file on the HDFS with the name sample_data.txt.

Now we put our script wordcount_script.pig on the cluster. We use Pig Latin for creating script.

xxxxxxxxxx

lines = LOAD 'hdfs://localhost:54310/pig_Data/sample_data.txt' AS (line:chararray);

words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;

grouped = GROUP words BY word;

wordcount = FOREACH grouped GENERATE group, COUNT(words);

DUMP wordcount;

Now use run command for running the script.
xxxxxxxxxx
run wordcount_script.pig;

Till now we have seen set-up process of Hadoop cluster and connect Apache Pig with that and run some basic commands and script.

I hope it will help you to start Apache Pig with Hadoop cluster.

Thanks

References:

Pig Documentation

Recommend

【TV最前线】广电总局广科院与歌华有线签约，加强广电5G、TVOS等方面合作

GitHub - EssayKillerBrain/EssayKiller_V2: 基于开源GPT2.0的初代创作型人工智能 |...

Hive-Metastore : A Basic Introduction

KnolX - Introduction to Streaming in Apache Spark - Knoldus Blogs

GitHub - ashemery/exploitation-course: Offensive Software Exploitation Course

GitHub - RosettaCommons/RoseTTAFold: This package contains deep learning models...

GitHub - sml2h3/captcha_server: 一个免费开源一键搭建的通用验证码识别平台，大部分...

中国广电领导在国干网首站机房现场指挥“庆祝中国共产党成立100周年大会”安全传输保障...

GitHub - maximlarionov/goskulugi

Here Is the Exact Time Blocking Template My Millionaire CEO Uses to Prioritise H...

About Joyk