Hadoop – Word Count

Hello Readers,

In the last few months I am getting my feet wet with Hadoop & Map Reduce. Hadoop infrastructure provides a beautiful solution for processing massive amounts of data. It uses distributed storage and computing. I won’t get into why/what/how Hadoop works. This post assumes you have some background about Hadoop and are thinking to write some MR code.

Word Count program is the Hello World of Hadoop.

Problem: Given a text file; write a Map Reduce program to output the frequency of each word.

Major components:

  1. Mapper: map() will be executed for each input line. It tokenizes the line into words and output the word as key and 1 as value. Ex: If word “Tendulkar” appears in the line twice, then mapper will output Tendulkar,1.
  2. Reducer: The output of mapper is shuffled and partitioned before giving it as input to reducer. reduce() will receive unique keys from mapper and iterable of all values per key. Ex: If word “Tendulkar” appears in the file 6 times, the input to reducer will be: key – Tendulkar, value – <1,1,1,1,1,1>. The reduce() will simply add up of the values in iterable and output the result with the word.

Let’s take a look at Mapper class:

public static class WordCountMapper extends Mapper<Object, Text, Text, LongWritable>
    {
        Text         outKey   = new Text();
        LongWritable outValue = new LongWritable(1);

        @Override
        protected void map(Object key, Text value, Mapper<Object, Text, Text, LongWritable>.Context context)
                throws IOException, InterruptedException
        {
            String[] values = value.toString().split(" ");

            for (String v : values)
            {
                outKey.set(v);

                context.write(outKey, outValue);
            }

        }

    }

Reducer class:

public static class WordCountReducer extends Reducer<Text, LongWritable, Text, LongWritable>
    {
        LongWritable result = new LongWritable();

        @Override
        protected void reduce(Text key, Iterable<LongWritable> values,
                Reducer<Text, LongWritable, Text, LongWritable>.Context context) throws IOException,
                InterruptedException
        {
            long count = 0;
            for (LongWritable val : values)
            {
                count++;
            }

            result.set(count);

            context.write(key, result);
        }

    }

You need to stitch the Mapper and Reducer classes in the main method of the Driver class. Complete example can be found at:

https://github.com/badalrocks/hadoop/blob/master/src/com/badal/mapreduce/samples/WordCount.java

Input File: Received as first argument args[0]

https://github.com/badalrocks/hadoop/blob/master/resources/input/word_count/sachin_wiki.txt

Output File: Received as second argument args[1]

https://github.com/badalrocks/hadoop/blob/master/resources/output/word_count/word_count-r-00000

Hope this is helpful.

Thank you!

Advertisements

About Badal Chowdhary

I am a Software Engineer by profession. I have done SCJP and SCWCD certifications. Like working on cutting edge technologies and frameworks. Driven by challenges and fascinated by technology. I love playing and watching sports: Cricket, Ping Pong, Tennis, Badminton, Racket Ball and Gym.
This entry was posted in Hadoop, Map reduce and tagged , , , . Bookmark the permalink.

One Response to Hadoop – Word Count

  1. Sachin says:

    Great article thanks

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s