Steve Loughran: Hadoop work in the new API in Groovy.

I've been doing some actual Map/Reduce work with the new API, to see how it's changed. One issue: not enough documentation. Here then is some more, and different in a very special way: the code is in Groovy.

To use them: get the groovy-all JAR on your Hadoop classpath and use the groovyc compiler to compile your groovy source (and any java source nearby) into your JAR, bring up your cluster and submit the work like anything else.

This pair of operations, part of a test to see how well Groovy MR jobs work just counts the lines in a source file; about as simple as you can get.

The mapper:

package org.apache.example.groovymr

import org.apache.hadoop.io.IntWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.mapreduce.Mapper

class GroovyLineCountMapper extends Mapper {

    final static def emitKey = new Text("lines")
    final static def one = new IntWritable(1)

    void map(LongWritable key,
             Text value,
             Mapper.Context context) {
        context.write(emitKey, one)
    }
}

Nice and simple; little different from the Java version except

Semicolons are optional.
The line ending rules are stricter to compensate, hence lines end with a comma or other half-finished operation.
You don't have to be so explicit about type (the def declarations) -and let the runtime sort it out. I have mixed feelings about that.

There's one other quirk in that the Context parameter for the map operation (which is a generic type of the parent class) has to be explicitly declared as Mapper.Context. I have no idea why, except that it won't compile otherwise. The same goes for the Reducer.Context

Not much to see there then. What is more interesting is the reduction side of things.

package org.apache.example.groovymr
import org.apache.hadoop.io.IntWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.mapreduce.Reducer

class GroovyValueCountReducer 
        extends Reducer {

    void reduce(Text key,
                Iterable values,
                Reducer.Context context) {
        int sum = values.collect() { it.get() } .sum()
        context.write(key, new IntWritable(sum));
    }
}

See the line in bold? It's taking the iterable of the list of values for that key, applying a closure to them (getting the values), which returns a list of type integer, which is then all summed up. That is: there is a per-element transform (it.get()) and a merging of all the results (sum()). Which is. when you think about it, almost a Map and a Reduce.

It's turtles all the way down.

[Artwork: Banksy, 2009 Bristol exhibition]

Steve Loughran

2011-10-14

Hadoop work in the new API in Groovy.

No comments:

Post a Comment