Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
495 views
in Technique[技术] by (71.8m points)

unix - Random access to gzipped files?

I have a very large file compressed with gzip sitting on disk. The production environment is "Cloud"-based, so the storage performance is terrible, but CPU is fine. Previously, our data processing pipeline began with gzip -dc streaming the data off the disk.

Now, in order to parallelise the work, I want to run multiple pipelines that each take a pair of byte offsets - start and end - and take that chunk of the file. With a plain file this could be achieved with head and tail, but I'm not sure how to do it efficiently with a compressed file; if I gzip -dc and pipe into head, the offset pairs that are toward the end of the file will involve wastefully seeking through the whole file as it's slowly decompressed.

So my question is really about the gzip algorithm - is it theoretically possible to seek to a byte offset in the underlying file or get an arbitrary chunk of it, without the full implications of decompressing the entire file up to that point? If not, how else might I efficiently partition a file for "random" access by multiple processes while minimising the I/O throughput overhead?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Yes, you can access a gzip file randomly by reading the entire thing sequentially once and building an index. See examples/zran.c in the zlib distribution.

If you are in control of creating the gzip file, then you can optimize the file for this purpose by building in random access entry points and construct the index while compressing.

You can also create a gzip file with markers by using Z_SYNC_FLUSH followed by Z_FULL_FLUSH in zlib's deflate() to insert two markers and making the next block independent of the previous data. This will reduce the compression, but not by much if you don't do this too often. E.g. once every megabyte should have very little impact. Then you can search for a nine-byte marker (with a much less probable false positive than bzip2's six-byte marker): 00 00 ff ff 00 00 00 ff ff.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...