Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
530 views
in Technique[技术] by (71.8m points)

hadoop - manipulating iterator in mapreduce

I am trying to find the sum of any given points using hadoop, The issue I am having is on getting all values from a given key in a single reducer. It looks like this.

Reducer:

 public static class Reduce extends MapReduceBase implements
        Reducer<Text, IntWritable, Text, DoubleWritable> {

    public void reduce(Text key, Iterator<IntWritable> values,
            OutputCollector<Text, DoubleWritable> output, Reporter reporter)
            throws IOException {
        Text word = new Text();

        Iterator<IntWritable> tr = values;
        IntWritable v;
        while (tr.hasNext()) {
             v = tr.next();

            Iterator<IntWritable> td = values;
            while (td.hasNext()) {

                IntWritable u = td.next();
                double sum = u+v;
                word.set( u + " + " + v);
                output.collect(word, new DoubleWritable(sum));
            }
        }
    }
}

And I am trying to create two copies of the Iterator variable so that I can go through all the values of the second iterator while I get a single value from the previous Iterator( Two while loops above) but the two iterators hold the same value all the time.

I am not sure if this is the right way to do it.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The iterators in the reducer are not as simple as you might think.

The issue is that the total number of items that you are iterating through might not fit into memory. That means that the iterator may be reading from disk. If you have two independent copies of the iterator, then you can have one of them far ahead of the other which implies that the data between where the two iterators point can't be dropped.

For simplicity of implementation, Hadoop doesn't support having more than one iterator for the reduce values.

The practical impact of this is that you can't go through the same iterator twice. That isn't nice, but it is the case. If you absolutely know that the number of items will fit into memory, then you can copy all the items into a list as suggested by MrGomez. If you don't know that, you may have to use secondary storage.

The better approach is to redesign your program so that you don't need unbounded storage in the reducer. This can get a bit tricky, but there are standard approaches to the problem.

For your particular problem, you have a quadratic growth in output size relative to the largest reduce input set. This is usually a really bad idea. In most cases you don't need ALL pairs, just the most important pairs. If you can trim the set of pairs in some way, then you are all set and you may be able to remove the all pairs constraint.

For instance, if you are trying to find the 100 pairs with the largest sum for each reduce set, you can keep a priority queue with the 100 largest inputs seen so far and a priority queue with the 100 largest sums seen so far. For each new input, you can form the sum with the largest 100 numbers seen so far and try to stick those sums into the second queue. Finally, you should stick the new input into the first queue and trim both queues to 100 elements by deleting the smallest values (if necessary). In the close method of the reduce, you should dump the priority queue. This approach guarantees that you only need min(n^2, 200) elements of storage which avoids the n^2 problem and avoids the double pass through the input by keeping the 100 largest items seen rather than all items seen.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...