Pages

Thursday, April 16, 2020

Hadoop WordCount tutorial

In this post we will write a word count program for Hadoop which is the equivalent of a Hello World program for any other language. This tutorial assumes that you have Hadoop already setup and is running. Begin by copying the code below and saving the file as WordCount.java
Note the name is important. It must be exactly equal to the class name. Before running the code set the environment variables as follows
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
export PATH=${JAVA_HOME}/bin:${PATH}
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar

Compile WordCount.java and create a jar:
$ hadoop com.sun.tools.javac.Main WordCount.java
jar cf wc.jar WordCount*.class

Create two files for input to the Map-Reduce. We will make two files in the input folder as follows
mkdir input
echo "Hello World Bye World" > input/file01
echo "Hello Hadoop Goodbye Hadoop" > input/file02

We also need to create the input folder on HDFS
hadoop fs -mkdir -p /user/$USER/input

Now we need to move these files into HDFS
hadoop fs -copyFromLocal input/ /user/$USER/

Verify that the files have been copied
hadoop fs -ls /user/zaid/input

It should show two files as follows
Found 2 items
-rw-r--r--   1 zaid supergroup         22 2020-04-17 09:55 /user/zaid/input/file01
-rw-r--r--   1 zaid supergroup         28 2020-04-17 09:55 /user/zaid/input/file02

Now lets run the application
hadoop jar wc.jar WordCount /user/$USER/input /user/$USER/output

The program should run and show a lot of output and hopefully no errors. Once complete you can check the output as follows
$ hadoop fs -cat /user/$USER/output/part-r-00000