A file compressed with the GZIP codec cannot be split because of the way this codec works.
A single SPLIT in Hadoop can only be processed by a single mapper; so a single GZIP file can only be processed by a single Mapper.
There are atleast three ways of going around that limitation:
- As a preprocessing step: Uncompress the file and recompress using a splittable codec (LZO)
- As a preprocessing step: Uncompress the file, split into smaller sets and recompress. (See this)
- Use this patch for Hadoop (which I wrote) that allows for a way around this: Splittable Gzip
HTH
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…