Create a Hadoop playground with Docker Desktop on Windows in minutes
source link: https://dev.to/txfs19260817/create-a-hadoop-playground-with-docker-desktop-on-windows-in-minutes-10im
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
This semester, I've chosen to take a course about parallel computing. One of the projects involves writing a MapReduce program on Hadoop in Java. Connecting to the school's computing resources might be difficult at times, especially when a due date is approaching. As a result, I looked for an easy way to set up a local Hadoop environment with Docker on my Windows laptop, so that I could conduct some experiments quickly.
Preparations
Docker
Docker saves us from having to go through complicated installation procedures for certain softwares (including Hadoop in this post), and it also allows us to clearly delete them if we need to free up some disk space.
The first step is to download and install a Docker Desktop for Windows (for Mac if your OS is Mac) on your computer. Now, Docker Desktop supports using WSL 2 (Windows Subsystem for Linux 2) instead of Hyper-V as the backend. If you do not have WSL on your Windows machine, you could follow this official guide to enable it.
You may check the versions by typing the following commands in your terminal (Powershell/WSL shell) to test the correct installation of both Docker and Docker Compose once the Docker Desktop is installed and running.
$ docker --version
Docker version 20.10.13, build a224086
$ docker-compose --version
Docker Compose version v2.3.3
Enter fullscreen mode
Exit fullscreen mode
It's also possible to check Docker's functioning by launching a sample docker container.
$ docker run -d -p 80:80 --name myserver nginx
Enter fullscreen mode
Exit fullscreen mode
VSCode
I'm sure every developer has installed VSCode, so you just need to make sure you have the plugin Remote Development installed in your VSCode. This enables you to develop in a container, on a remote machine, or in WSL.
Let’s go Hadoop
Setup
As you can see from the steps below, setting up a Hadoop environment with Docker is rather simple.
Clone the repo big-data-europe/docker-hadoop under a certain path, then setup the Hadoop cluster via docker-compose
.
git clone [email protected]:big-data-europe/docker-hadoop.git
cd docker-hadoop
docker-compose up -d
Enter fullscreen mode
Exit fullscreen mode
Now, you are all set :). After a few moments, you can check if it is working properly by visiting http://localhost:9870/.
Get on the train
It's finally time to meet your new Hadoop cluster. Start VS Code, then go to the left panel and select the Remote Development plugin. Select "Containers" from the dropdown above, then locate and connect to a container named "namenode" by clicking "Attach to the container" icon. You've arrived in the Hadoop world!
Hello "WordCount"
This is how we will test the Hadoop cluster. We will run the Word Count example (from source code in Java) to see how it works.
But don't hurry up just yet. Here are some more things to do. We need to make some sample input data.
Open the terminal in the VS Code. Then run:
mkdir input
echo "Hello World" > input/f1.txt
echo "Hello Docker" > input/f2.txt
Enter fullscreen mode
Exit fullscreen mode
The inputs we have created are stored in your local (more precisely, in the docker container local). We also need to copy them to the HDFS.
hadoop fs -mkdir -p input
hdfs dfs -put ./input/* input
Enter fullscreen mode
Exit fullscreen mode
After the preparation, you can get the official WordCount example from this link.
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Enter fullscreen mode
Exit fullscreen mode
Save it with the filename WordCount.java
.
Now, let's find out if it works.
export HADOOP_CLASSPATH=$JAVA_HOME/lib/tools.jar
hadoop com.sun.tools.javac.Main WordCount.java
jar cf wordcount.jar WordCount*.class
hadoop jar wordcount.jar WordCount input output
Enter fullscreen mode
Exit fullscreen mode
It looks like there are a lot of logs, but what we care about is the output. Print them out with cat
command.
$ hdfs dfs -cat output/part-r-0000*
Docker 1
Hello 2
World 1
Enter fullscreen mode
Exit fullscreen mode
Hooray! We did it!
Clean-up
This is simple. Using this command will make your computer's life easier.
docker-compose down
Enter fullscreen mode
Exit fullscreen mode
Acknowledgement
This article significantly references this article by José Lise. I've added some content on how to develop with VSCode.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK