Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
354 views
in Technique[技术] by (71.8m points)

mapreduce - hadoop converting to and breaking ARC format

I am trying to parse data from commoncrawl.org using hadoop streaming. I set up a local hadoop to test my code, and have a simple Ruby mapper which uses a streaming ARCfile reader. When I invoke my code myself like

cat 1262876244253_18.arc.gz | mapper.rb | reducer.rb

It works as expected.

It seems that hadoop automatically sees that the file has a .gz extension and decompresses it before handing it to a mapper - however while doing so it converts linebreaks in the stream to . Since ARC relies on a record length in the header line, the change breaks the parser (because the data length has changed).

To double check, I changed my mapper to expect uncompressed data, and did:

cat 1262876244253_18.arc.gz | zcat | mapper.rb | reducer.rb

And it works.

I don't mind hadoop automatically decompressing (although I can quite happily deal with streaming .gz files), but if it does I need it to decompress in 'binary' without doing any linebreak conversion or similar. I believe that the default behaviour is to feed decompressed files to one mapper per file, which is perfect.

How can I either ask it not to decompress .gz (renaming the files is not an option) or make it decompress properly? I would prefer not to use a special InputFormat class which I have to ship in a jar, if at all possible.

All of this will eventually run on AWS ElasticMapReduce.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Looks like the Hadoop PipeMapper.java is to blame (at least in 0.20.2):

Around line 106, the input from TextInputFormat is passed to this mapper (at which stage the has been stripped), and the PipeMapper is writing it out to stdout with just a .

A suggestion would be to amend the source for your PipeMapper.java, check this 'feature' still exists, and amend as required (maybe allow it to be set via a configuration property).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...