Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
441 views
in Technique[技术] by (71.8m points)

hadoop - MultipleTextOutputFormat alternative in new API

As it stands out MultipleTextOutputFormat have not been migrated to the new API. So if we need to choose an output directory and output fiename based on the key-value being written on the fly, then what's the alternative we have with new mapreduce API ?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I'm using AWS EMR Hadoop 1.0.3, and it is possible to specify different directories and files based on k/v pairs. Use either of the following functions from the MultipleOutputs class:

public void write(KEYOUT key, VALUEOUT value, String baseOutputPath)

or

public <K,V> void write(String namedOutput, K key, V value,
                        String baseOutputPath)

The former write method requires the key to be the same type as the map output key (in case you are using this in the mapper) or the same type as the reduce output key (in case you are using this in the reducer). The value must also be typed in similar fashion.

The latter write method requires the key/value types to match the types specified when you setup the MultipleObjects static properties using the addNamedOutput function:

public static void addNamedOutput(Job job,
                              String namedOutput,
                              Class<? extends OutputFormat> outputFormatClass,
                              Class<?> keyClass,
                              Class<?> valueClass)

So if you need different output types than the Context is using, you must use the latter write method.

The trick to getting different output directories is to pass a baseOutputPath that contains a directory separator, like this:

multipleOutputs.write("output1", key, value, "dir1/part");

In my case, this created files named "dir1/part-r-00000".

I was not successful in using a baseOutputPath that contains the .. directory, so all baseOutputPaths are strictly contained in the path passed to the -output parameter.

For more details on how to setup and properly use MultipleOutputs, see this code I found (not mine, but I found it very helpful; does not use different output directories). https://github.com/rystsov/learning-hadoop/blob/master/src/main/java/com/twitter/rystsov/mr/MultipulOutputExample.java


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...