This semester, I've chosen to take a course about parallel computing. One of the projects involves writing a MapReduce program on Hadoop in Java. Connecting to the school's computing resources might be difficult at times, especially when a due date is approaching. As a result, I looked for an easy way to set up a local Hadoop environment with Docker on my Windows laptop, so that I could conduct some experiments quickly.

Preparations

Docker

Docker saves us from having to go through complicated installation procedures for certain softwares (including Hadoop in this post), and it also allows us to clearly delete them if we need to free up some disk space.

The first step is to download and install a Docker Desktop for Windows (for Mac if your OS is Mac) on your computer. Now, Docker Desktop supports using WSL 2 (Windows Subsystem for Linux 2) instead of Hyper-V as the backend. If you do not have WSL on your Windows machine, you could follow this official guide to enable it.

You may check the versions by typing the following commands in your terminal (Powershell/WSL shell) to test the correct installation of both Docker and Docker Compose once the Docker Desktop is installed and running.

$ docker --version
Docker version 20.10.13, build a224086
$ docker-compose --version
Docker Compose version v2.3.3

Enter fullscreen mode

Exit fullscreen mode

It's also possible to check Docker's functioning by launching a sample docker container.

$ docker run -d -p 80:80 --name myserver nginx

Enter fullscreen mode

Exit fullscreen mode

VSCode

I'm sure every developer has installed VSCode, so you just need to make sure you have the plugin Remote Development installed in your VSCode. This enables you to develop in a container, on a remote machine, or in WSL.

Let’s go Hadoop

Setup

As you can see from the steps below, setting up a Hadoop environment with Docker is rather simple.

Clone the repo big-data-europe/docker-hadoop under a certain path, then setup the Hadoop cluster via docker-compose.

git clone [email protected]:big-data-europe/docker-hadoop.git
cd docker-hadoop
docker-compose up -d

Enter fullscreen mode

Exit fullscreen mode

Now, you are all set :). After a few moments, you can check if it is working properly by visiting http://localhost:9870/.

Get on the train

It's finally time to meet your new Hadoop cluster. Start VS Code, then go to the left panel and select the Remote Development plugin. Select "Containers" from the dropdown above, then locate and connect to a container named "namenode" by clicking "Attach to the container" icon. You've arrived in the Hadoop world!

Hello "WordCount"

This is how we will test the Hadoop cluster. We will run the Word Count example (from source code in Java) to see how it works.

But don't hurry up just yet. Here are some more things to do. We need to make some sample input data.

Open the terminal in the VS Code. Then run:

mkdir input
echo "Hello World" > input/f1.txt
echo "Hello Docker" > input/f2.txt

Enter fullscreen mode

Exit fullscreen mode

The inputs we have created are stored in your local (more precisely, in the docker container local). We also need to copy them to the HDFS.

hadoop fs -mkdir -p input
hdfs dfs -put ./input/* input

Enter fullscreen mode

Exit fullscreen mode

After the preparation, you can get the official WordCount example from this link.

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

Enter fullscreen mode

Exit fullscreen mode

Save it with the filename WordCount.java.

Now, let's find out if it works.

export HADOOP_CLASSPATH=$JAVA_HOME/lib/tools.jar
hadoop com.sun.tools.javac.Main WordCount.java
jar cf wordcount.jar WordCount*.class
hadoop jar wordcount.jar WordCount input output

Enter fullscreen mode

Exit fullscreen mode

It looks like there are a lot of logs, but what we care about is the output. Print them out with cat command.

$ hdfs dfs -cat output/part-r-0000*
Docker 1
Hello 2
World 1

Enter fullscreen mode

Exit fullscreen mode

Hooray! We did it!

Clean-up

This is simple. Using this command will make your computer's life easier.

docker-compose down

Enter fullscreen mode

Exit fullscreen mode

Acknowledgement

This article significantly references this article by José Lise. I've added some content on how to develop with VSCode.

Create a Hadoop playground with Docker Desktop on Windows in minutes

Preparations

Docker

VSCode

Let’s go Hadoop

Setup

Get on the train

Hello "WordCount"

Clean-up

Acknowledgement

Recommend

Create Popup or Modal Component in React

使用腾讯云函数每天定时签到京东领取京豆教程

Scripts to Rule Them All

An Object-Oriented React App Design

Bitcoin Lessons From A Canadian Trucker - Bitcoin Magazine: Bitcoin News, Articl...

Xiaomi sold 190 million units in 2021, fiscal report shows

What you need to know about macOS X 10.14 Deprecation

20 WordPress Video Plugins and Players to Add Engagement

The best place for CP!

Teams Going to the 2022 ICPC North America Championship

About Joyk