out of memory - Python - How to gzip a large text file without MemoryError?

Question

Welcome To Ask or Share your Answers For Others

out of memory - Python - How to gzip a large text file without MemoryError?

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

out of memory - Python - How to gzip a large text file without MemoryError?

I use the following simple Python script to compress a large text file (say, 10GB) on an EC2 m3.large instance. However, I always got a MemoryError:

import gzip

with open('test_large.csv', 'rb') as f_in:
    with gzip.open('test_out.csv.gz', 'wb') as f_out:
        f_out.writelines(f_in)
        # or the following:
        # for line in f_in:
        #     f_out.write(line)

The traceback I got is:

Traceback (most recent call last):
  File "test.py", line 8, in <module>
    f_out.writelines(f_in)
MemoryError

I have read some discussion about this issue, but still not quite clear how to handle this. Can someone give me a more understandable answer about how to deal with this problem?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T19:16:35+0000

The problem here has nothing to do with gzip, and everything to do with reading line by line from a 10GB file with no newlines in it:

As an additional note, the file I used to test the Python gzip functionality is generated by fallocate -l 10G bigfile_file.

That gives you a 10GB sparse file made entirely of 0 bytes. Meaning there are no newline bytes. Meaning the first line is 10GB long. Meaning it will take 10GB to read the first line. (Or possibly even 20 or 40GB, if you're using pre-3.3 Python and trying to read it as Unicode.)

If you want to copy binary data, don't copy line by line. Whether it's a normal file, a GzipFile that's decompressing for you on the fly, a socket.makefile(), or anything else, you will have the same problem.

The solution is to copy chunk by chunk. Or just use copyfileobj, which does that for you automatically.

import gzip
import shutil

with open('test_large.csv', 'rb') as f_in:
    with gzip.open('test_out.csv.gz', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

By default, copyfileobj uses a chunk size optimized to be often very good and never very bad. In this case, you might actually want a smaller size, or a larger one; it's hard to predict which a priori.* So, test it by using timeit with different bufsize arguments (say, powers of 4 from 1KB to 8MB) to copyfileobj. But the default 16KB will probably be good enough unless you're doing a lot of this.

_{* If the buffer size is too big, you may end up alternating long chunks of I/O and long chunks of processing. If it's too small, you may end up needing multiple reads to fill a single gzip block.}

Categories

out of memory - Python - How to gzip a large text file without MemoryError?

out of memory - Python - How to gzip a large text file without MemoryError?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags