Setting Up Multi-Node Hadoop Cluster , just got easy !

Reading Time: 3 minutes

In this blog,we are going to embark the journey of how to setup the Hadoop Multi-Node cluster on a distributed environment.

So lets do not waste any time, and let’s get started.
Here are steps you need to perform.

Prerequisite:

1.Download & install Hadoop for local machine (Single Node Setup)
http://hadoop.apache.org/releases.html – 2.7.3
use java : jdk1.8.0_111
2. Download Apache Spark from : http://spark.apache.org/downloads.html
choose spark release : 1.6.2

1. Mapping the nodes

First of all ,we have to edit hosts file in /etc/ folder on all nodes, specify the IP address of each system followed by their host names.

# vi /etc/hosts
 enter the following lines in the /etc/hosts file.
 192.168.1.xxx hadoop-master 
 192.168.1.xxx hadoop-slave-1 
 192.168.56.xxx hadoop-slave-2

2. Password less login through ssh.

Then we need to setup ssh password less login. For this, we need to Configure Key Based Login
Setup ssh in every node such that they can communicate with one another without any prompt for password.

# su hduser 
 $ ssh-keygen -t rsa 
 $ ssh-copy-id -i ~/.ssh/id_rsa.pub hduser@hadoop-master 
 $ ssh-copy-id -i ~/.ssh/id_rsa.pub hduser@hadoop-slave-1 
 $ ssh-copy-id -i ~/.ssh/id_rsa.pub hduser@hadoop-slave-2

Note: ssh folder should have permission : 700 & authorised_key should have 644 and hduser should have 755 permission in both master & slaves. (This is very important as it has wasted my time a lot )

3. Setup java environment for master and slave

Folder structure for both must be same.
Extract your java in /home/hduser/software and set the path in hduser’s .bashrc as :

export JAVA_HOME=/home/hduser/software/jdk1.8.0_111

4.Configuring Hadoop

Install your hadoop in /usr/local
Set $HADOOP_HOME in bashrc as:

export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin

Create a directory named hadoop_data in opt folder and dfs in $HADOOP_HOME
Inside dfs create a directory named name and inside name create a directory named data
The permissions for name and dfs should be 777.
Make sure that hadoop_data folder in opt folder is owned by hduser and its permissions should be 777

Your core-site.xml file should look like:

<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/hadoop_data</value>
<description>directory for hadoop data</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://hadoop-master:54311</value>
<description> data to be put on this URI</description>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop-master:54311</value>
<description>Use HDFS as file storage engine</description>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>

Your hdfs-site.xml file should look like:

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/usr/local/hadoop/dfs/name/data</value>
<final>true</final>
</property>
<property>
<name>dfs.name.dir</name>
<value>/usr/local/hadoop/dfs/name</value>
<final>true</final>
</property>
</configuration>

Your mapred-site.xml should look like:

<configuration>
 <property> 
 <name>mapred.job.tracker</name> 
 <value>hadoop-master:9001</value> 
 </property> 
</configuration>

Your yarn-site.xml should look like:

<configuration>
<!-- Site specific YARN configuration properties -->
<property>
 <name>yarn.nodemanager.aux-services</name>
 <value>mapreduce_shuffle</value>
</property>
<property>
 <name>yarn.resourcemanager.scheduler.address</name>
 <value>hadoop-master:8030</value>
</property> 
<property>
 <name>yarn.resourcemanager.address</name>
 <value>hadoop-master:8032</value>
</property>
<property>
 <name>yarn.resourcemanager.webapp.address</name>
 <value>hadoop-master:8088</value>
</property>
<property>
 <name>yarn.resourcemanager.resource-tracker.address</name>
 <value>hadoop-master:8031</value>
</property>
<property>
 <name>yarn.resourcemanager.admin.address</name>
 <value>hadoop-master:8033</value>
</property>
</configuration>

Now Set JAVA_HOME in hadoop-env.sh as

export JAVA_HOME=/home/hduser/software/jdk1.8.0_111

In master node set slaves IP address in $HADOOP_HOME/etc/hadoop/slaves file as:

hadoop-master
hadoop-slave-1
hadoop-slave-2
Remove localhost entry from the above file.

Important Note:Location of hadoop and spark should be same in master and slaves .

5.Configuring Spark

Install spark in /home/hduser/software
Set your $SPARK_HOME in bashrc as:

export SPARK_HOME=/home/hduser/software/spark-1.6.2-bin-hadoop2.6

1.Add the following line in spark-env.sh

 export SPARK_MASTER_IP=192.168.2.xxx //IP address of master

2.Copy your hdfs-site.xml and core-site.xml file from $HADOOP_HOME/etc/hadoop and put it in $SPARK_HOME/conf folder.
3.In master node add IP addresses of slaves in slaves file located in $SPARK_HOME/conf.

To run hadoop:
Go to $HADOOP_HOME in the master and run: hadoop namenode -format
cd $HADOOP_HOME/sbin and run: start-dfs.sh followed by start-yarn.sh

Important Note:

1.Start-dfs.sh will start NameNode, SecondaryNamenode, DataNode on master and DataNode on all slaves node.
2.Start-yarn.sh will start NodeManager, ResourceManager on master node and NodeManager on slaves.
3.Perform hadoop namenode -format only once otherwise you will get incompatible cluster_id exception. To resolve this error Clear temporary data location for datanode i.e remove the files present in $HADOOP_HOME/dfs/name/data folder.

Use the following command: rm -rf filename

Start spark. Go to $SPARK_HOME/sbin and run start-all.sh
Start thrift server and log into beeline using hduser as username and password.
To start the thrift server use the following command inside $SPARK_HOME

./bin/spark-submit –master spark master IP –conf
spark.sql.hive.thriftServer.singleSession=true –class pathOfClassToRun pathToYourApplicationJar hdfs://hadoop-master:54311/pathToStoreLocation

Note:
Look for spark master IP in master node at this address: hadoop-master:8080
If you face any issue then refer to the below section.
Look in the hadoop logs folder located in $HADOOP_HOME/logs, if you find any of these issue:

1.Error for Incompatible cluster ids : Clear temporary data location for datanode
2.Failed to start database : if you face this problem, then remove metastore _db
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@7962a746 : remove metastore_db/dbex.lck
3.HiveSQLException: org.apache.hadoop.security.AccessControlException: Permission denied: user=anonymous, access=WRITE : Login to beeline with user as hduser and password as hduser

Reference: http://www.bigdataplanet.info/2013/10/Hadoop-Installation-on-Local-Machine-Single-node-Cluster.html

Prerequisite:

Recommend

Functional Programming In Javascript

Rejection Handling in Akka Http.

Is Cisco’s Software-Centric Strategy Really a Strategy?

Creating A High Order Function From A BiFunction And Predicate In Java8

以太坊和它的杀手们（一）：EOS和TRON

Lambda Architecture with Spark

Introduction to Actor Model [Akka in a Nutshell #1]

金色DeFi日报 | Vee.Finance官方确认超3500万美元资产被盗

Neo4j vs ElasticSearch & Full Text Search In Neo4j

Hive Database : A basic Introduction

About Joyk