Sample Map/Reduce Project with Hadoop 1.2.1

Tutorial to create a Map/Reduce project example using Hadoop to count words. Here I’ll show you how to quickly perform simple functions on a Hadoop Node.

 

Prerequisites

I’m considering that you already have a Hadoop Node working. If you don’t already had a Single Node configured, check my post Setup a Hadoop Single Node – Version 1.2.1.

All steps below will perfectly work with my “Setup a Hadoop Single Node” post.

 

01. Access from SSH your Hadoop Master Node

Just run SSH to localhost or node IP.

$ ssh localhost

 

02. Format a new distributed-filesystem

This command will wipe all data on your Hadoop Cluster.

$ hadoop namenode –format

 

03. Create a text file with some text

In this example I’ll create a text file from a API generator of sample content. This name’s file will be “source.txt”.

$ wget -O source.txt https://baconipsum.com/api/?type=all-meat\&paras=700\&start-with-lorem=1\&format=html

 

04. Send file to HDFS (Hadoop Distributed File System)

Create a input directory:

$ hadoop fs -mkdir input

Send “source.txt” to “input” folder at HDFS:

$ hadoop fs -put source.txt input/source.txt

You can see all files from “ls” argument, just like:

$ hadoop fs -ls
$ hadoop fs -ls input/

 

05. Execute word count from Hadoop

This application is a example created by Hadoop’s developers that can be found at “/opt/hadoop-1.2.1”. In this tutorial we’ll use this example application.

This application count words that match with a regular expression created by you.

$ hadoop jar /opt/hadoop-1.2.1/hadoop-examples-*.jar grep input output ‘ba[A-z]+’

Explaining the command above:

1. hadoop                    // Invoke hadoop executable
2. jar                       // Arg to start a "jar" application
3. hadoop-examples-*.jar     // Java application
4. grep
5. input                     // Directory with source text files
6. output                    // Directory to output results
7. 'ba[A-z]+'                // Expression

So, the expression above will count how many words start with “ba”.

 

06. How to see the results

The results can be found at “output” folder at HDFS. But you can see all text files from browser too.

06.1 From command line

To see results from terminal, just run:

$ hadoop fs -cat output/part*

If you want to download the file, run:

$ hadoop fs -get output/part*

06.2 From browser

Or you can do the simple way. Just go to: http://localhost:50070

Now to “Browse the filesystem” > user > {node_name} > output > part-00000

And you will get something like that:

157 ball
92  back
88  basa
77  bacon

 

07. How to rerun this test?

To rerun this example test you will need to remove the “output”  folder from HDFS. This folder will be created again when you rerun the test.

To exclude the output folder, run:

> hadoop fs -rmr output

If you need to remove a file, just run the command without the “recursive”. Example:

> hadoop fs -rm file_to_be_removed

 

Now you can rerun the command:

$hadoop jar /opt/hadoop-1.2.1/hadoop-examples-*.jar grep input output ‘ba[A-z]+’

 

Conclusion

In this tutorial you learn how to list, add and remove files from HDFS (Hadoop Distributed File System) and how to start a Java Application on Hadoop.

With all this, now you will need to learn how to create other applications to run on your new cluster. I recommend to start from examples files on Hadoop directory (“/opt/hadoop-1.2.1/”). Try to read and modify the examples. Another way is look to YouTube for more tutorials.

When you learn how to develop your application and want to create a real cluster, with multiples machines, get a touch to “Cluster Setup” from Hadoop documentation.

Any question, just leave your comment below.

 

Credits and References

Hadoop Documentation. Stable 1 version:
https://hadoop.apache.org/docs/stable1/single_node_setup.html

Hadoop Single Node Setup by “Veera Sekhar”:
https://www.youtube.com/user/VeeraSekharPonakala/

Bacon Ipsum. Text generator. Used to create “source.txt” example file:
https://baconipsum.com/