6

Apache PIG : Installation and Connect with Hadoop Cluster

 3 years ago
source link: https://blog.knoldus.com/apache-pig-installation-and-connect-with-hadoop-cluster/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Apache PIG : Installation and Connect with Hadoop Cluster

Reading Time: 4 minutes

Apache PIG, It is a scripting platform for analyzing the large datasets. PIG is a high level scripting language which work with the Apache Hadoop. It enables workers to write complex transformation in simple script with the help PIG Latin. Apache PIG directly interact with the data in Hadoop cluster.

Apache PIG transform Pig script into the MapReduce jobs so it can execute with the Hadoop YARN for access the dataset stored in HDFS(Hadoop Distributed File System).

When we want to use Apache PIG we have to create Hadoop cluster and than run Pig on it. For creating the Hadoop cluster, you can follow this blog. While following this blog when we are creating hdfs-site.xml please make changes according to this.

xxxxxxxxxx
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>

Now when you have done with Hadoop, now it is time for install Pig. We can download latest version of Pig (pig-0.16.0) from here. After downloading we extract Pig from tar file and create a new Pig folder in /usr/lib/ and copy Pig within it.

xxxxxxxxxx
$ tar -xvf pig-0.16.0.tar.gz
$ mkdir /usr/lib/pig/
$ cd /usr/lib/pig/
$ mv /pig-0.16.0 /usr/lib/pig/

After copying Pig folder in /usr/lib/pig update the .bashrc file.

xxxxxxxxxx
export PIG_HOME="/usr/lib/pig/pig-0.16.0"
export PIG_CONF_DIR="$PIG_HOME/conf"
export PIG_CLASSPATH="$PIG_CONF_DIR"
export PATH="$PIG_HOME/bin:$PATH"

Now we have almost complete the process, restart the console or reload the .bashrc and we check we have install Pig correctly or not.

xxxxxxxxxx
$ pig -version

Screenshot from 2016-08-26 18:28:23

Now we have completed with the installation of Hadoop and Pig.

We can start Pig with these commands :

xxxxxxxxxx
$ pig -x local
$ pig -x mapreduce
or
$ pig

Here I want to discuss about these two commands for starting the Apache Pig with local or mapreduce.

When we run Pig at local mode, it will run on your local machine with the help of your localhost and local file system while when we run Pig on mapreduce mode, it will run on Hadoop cluster with HDFS.

By default Pig start in the mapreduce mode and when we want to run it in local mode we have to specify it. If we do not specify any mode it will run it mapreduce mode.

In other hand we created a folder in Hadoop cluster where we will keep all the script or text files.

xxxxxxxxxx
$ hdfs dfs -mkdir hdfs://localhost:54310/pig_Data

Screenshot from 2016-08-26 18:38:52.png

Screenshot from 2016-08-26 18:40:20.png

Now we put data file in Hadoop cluster.

xxxxxxxxxx
$ hdfs dfs -put /home/anurag/student_data.txt hdfs://localhost:54310/pig_Data/

Screenshot from 2016-08-26 20:50:00

Now we start Pig environment with the command :

xxxxxxxxxx
$ pig -x mapreduce

Screenshot from 2016-08-26 20:55:10

Now we are ready for running Pig commands on the Hadoop cluster.

1. We will LOAD file from the Hadoop cluster :

xxxxxxxxxx
students = LOAD 'hdfs://localhost:54310/pig_Data/student_data.txt'
  USING PigStorage(',') as ( id:int, firstname:chararray,
  lastname:chararray, phone:chararray, city:chararray );

Screenshot from 2016-08-26 21:03:23

2. We can STORE data directly on the cluster in the new directory. When we want
use STORE command it always make new directory. It does not use any present
directory.

xxxxxxxxxx
STORE students INTO ' hdfs://localhost:54310/pig_Output/ ' USING PigStorage (',');

Screenshot from 2016-08-26 21:28:31.png

We can see data from the both end :

  1. In Hadoop cluster we can see data :
    xxxxxxxxxx
    hdfs dfs -cat hdfs://localhost:54310/pig_Output/part-m-00000
  2. In Pig environment we can see data with :
    xxxxxxxxxx
    DUMP students;

We can run scripts directly from the command prompt with the run command. Here is an example for the Word Count. In this example we will LOAD a file from the HDFS and perform a Word Count operation on that file. So lets start :

  1. We assume that we have a text file on the HDFS with the name sample_data.txt.
  2. Now we put our script wordcount_script.pig on the cluster. We use Pig Latin for creating script.
    xxxxxxxxxx
    lines = LOAD 'hdfs://localhost:54310/pig_Data/sample_data.txt' AS (line:chararray);
    words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;
    grouped = GROUP words BY word;
    wordcount = FOREACH grouped GENERATE group, COUNT(words);
    DUMP wordcount;
  3. Now use run command for running the script.
    xxxxxxxxxx
    run wordcount_script.pig;

Till now we have seen set-up process of Hadoop cluster and connect Apache Pig with that and run some basic commands and script.

I hope it will help you to start Apache Pig with Hadoop cluster.

Thanks  🙂

References:

Pig Documentation



About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK