hadoop - MultipleTextOutputFormat alternative in new API

Question

Welcome To Ask or Share your Answers For Others

hadoop - MultipleTextOutputFormat alternative in new API

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:56:00+0000

I'm using AWS EMR Hadoop 1.0.3, and it is possible to specify different directories and files based on k/v pairs. Use either of the following functions from the MultipleOutputs class:

public void write(KEYOUT key, VALUEOUT value, String baseOutputPath)

or

public <K,V> void write(String namedOutput, K key, V value,
                        String baseOutputPath)

The former write method requires the key to be the same type as the map output key (in case you are using this in the mapper) or the same type as the reduce output key (in case you are using this in the reducer). The value must also be typed in similar fashion.

The latter write method requires the key/value types to match the types specified when you setup the MultipleObjects static properties using the addNamedOutput function:

public static void addNamedOutput(Job job,
                              String namedOutput,
                              Class<? extends OutputFormat> outputFormatClass,
                              Class<?> keyClass,
                              Class<?> valueClass)

So if you need different output types than the Context is using, you must use the latter write method.

The trick to getting different output directories is to pass a baseOutputPath that contains a directory separator, like this:

multipleOutputs.write("output1", key, value, "dir1/part");

In my case, this created files named "dir1/part-r-00000".

I was not successful in using a baseOutputPath that contains the .. directory, so all baseOutputPaths are strictly contained in the path passed to the -output parameter.

For more details on how to setup and properly use MultipleOutputs, see this code I found (not mine, but I found it very helpful; does not use different output directories). https://github.com/rystsov/learning-hadoop/blob/master/src/main/java/com/twitter/rystsov/mr/MultipulOutputExample.java

Categories

hadoop - MultipleTextOutputFormat alternative in new API

hadoop - MultipleTextOutputFormat alternative in new API

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags