Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
409 views
in Technique[技术] by (71.8m points)

io - Most efficient way to modify the last line of a large text file in Python

I need to update the last line from a few more than 2GB files made up of lines of text that can not be read with readlines(). Currently, it work fine by looping through line by line. However, I am wondering if there is any compiled library can achieve this more efficiently? Thanks!

Current approach

    myfile = open("large.XML")
    for line in myfile:
        do_something()
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

If this is really something line based (where a true XML parser isn't necessary the best solution), mmap can help here.

mmap the file, then call .rfind(' ') on the resulting object (possibly with adjustments to handle the file ending with a newline when you really want the non-empty line before it, not the empty "line" following it). You can then slice out the final line alone. If you need to modify the file in place, you can resize the file to shave off (or add) a number of bytes corresponding to the difference between the line you sliced and the new line, then write back the new line. Avoids reading or writing any more of the file than you need.

Example code (please comment if I made a mistake):

import mmap

# In Python 3.1 and earlier, you'd wrap mmap in contextlib.closing; mmap
# didn't support the context manager protocol natively until 3.2; see example below
with open("large.XML", 'r+b') as myfile, mmap.mmap(myfile.fileno(), 0, access=mmap.ACCESS_WRITE) as mm:
    # len(mm) - 1 handles files ending w/newline by getting the prior line
    # + 1 to avoid catching prior newline (and handle one line file seamlessly)
    startofline = mm.rfind(b'
', 0, len(mm) - 1) + 1

    # Get the line (with any newline stripped)
    line = mm[startofline:].rstrip(b'
')

    # Do whatever calculates the new line, decoding/encoding to use str
    # in do_something to simplify; this is an XML file, so I'm assuming UTF-8
    new_line = do_something(line.decode('utf-8')).encode('utf-8')

    # Resize to accommodate the new line (or to strip data beyond the new line)
    mm.resize(startofline + len(new_line))  # + 1 if you need to add a trailing newline
    mm[startofline:] = new_line  # Replace contents; add a b"
" if needed

Apparently on some systems (e.g. OSX) without mremap, mm.resize won't work, so to support those systems, you'd probably split the with (so the mmap closes before the file object), and use file object based seeks, writes and truncates to fix up the file. The following example includes my previously mentioned Python 3.1 and earlier specific adjustment to use contextlib.closing for completeness:

import mmap
from contextlib import closing

with open("large.XML", 'r+b') as myfile:
    with closing(mmap.mmap(myfile.fileno(), 0, access=mmap.ACCESS_WRITE)) as mm:
        startofline = mm.rfind(b'
', 0, len(mm) - 1) + 1
        line = mm[startofline:].rstrip(b'
')
        new_line = do_something(line.decode('utf-8')).encode('utf-8')

    myfile.seek(startofline)  # Move to where old line began
    myfile.write(new_line)  # Overwrite existing line with new line
    myfile.truncate()  # If existing line longer than new line, get rid of the excess

The advantages to mmap over any other approach are:

  1. No need to read any more of the file beyond the line itself (meaning 1-2 pages of the file, the rest never gets read or written)
  2. Using rfind means you can let Python do the work of finding the newline quickly at the C layer (in CPython); explicit seeks and reads of a file object could match the "only read a page or so", but you'd have to hand-implement the search for the newline

Caveat: This approach will not work (at least, not without modification to avoid mapping more than 2 GB, and to handle resizing when the whole file might not be mapped) if you're on a 32 bit system and the file is too large to map into memory. On most 32 bit systems, even in a newly spawned process, you only have 1-2 GB of contiguous address space available; in certain special cases, you might have as much as 3-3.5 GB of user virtual addresses (though you'll lose some of the contiguous space to the heap, stack, executable mapping, etc.). mmap doesn't require much physical RAM, but it needs contiguous address space; one of the huge benefits of a 64 bit OS is that you stop worrying about virtual address space in all but the most ridiculous cases, so mmap can solve problems in the general case that it couldn't handle without added complexity on a 32 bit OS. Most modern computers are 64 bit at this point, but it's definitely something to keep in mind if you're targeting 32 bit systems (and on Windows, even if the OS is 64 bit, they may have installed a 32 bit version of Python by mistake, so the same problems apply). Here's yet one more example that works (assuming the last line isn't 100+ MB long) on 32 bit Python (omitting closing and imports for brevity) even for huge files:

with open("large.XML", 'r+b') as myfile:
    filesize = myfile.seek(0, 2)
    # Get an offset that only grabs the last 100 MB or so of the file aligned properly
    offset = max(0, filesize - 100 * 1024 ** 2) & ~(mmap.ALLOCATIONGRANULARITY - 1)
    with mmap.mmap(myfile.fileno(), 0, access=mmap.ACCESS_WRITE, offset=offset) as mm:
        startofline = mm.rfind(b'
', 0, len(mm) - 1) + 1
        # If line might be > 100 MB long, probably want to check if startofline
        # follows a newline here
        line = mm[startofline:].rstrip(b'
')
        new_line = do_something(line.decode('utf-8')).encode('utf-8')

    myfile.seek(startofline + offset)  # Move to where old line began, adjusted for offset
    myfile.write(new_line)  # Overwrite existing line with new line
    myfile.truncate()  # If existing line longer than new line, get rid of the excess

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...