Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
110 views
in Technique[技术] by (71.8m points)

How to write file to binary, or edit single line in large file – Python

I have several large XML files, that won't parse due to some unrecognised character, the complaint is similar to:

xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 528370, column 153

On smaller files I am also seeing this, but can open the file with a text editor and fix the issue. However, my text editor won't read the large files.

I hacked together a Python script to print the line concerned, and I can see from that there appears to be a unicode encoding problem, in that the “μ” (for micro[metres]) is encoded xb5, where I think it should be x00B5. There are several of these on the same line.

I found that the only way to read that line was as a binary. Anything else wouldn't parse it (ie the unicode parser wouldn't read it).

I could not find a method to read that line, fix it, and then write back just that line.

So, in a desparate bid to get around that large file size I thought I could perhaps just split up the file on a line by line basis, edit the file with the error, and then stitch them back together. Each file in the split is 512,000 lines.

This obviously breaks the XML in the individual files - but not a problem if I stitch them back together in the right order. I can't parse the file into smaller XML elements, because, as above, ElementTree chokes on the encoding.

So, here is my script to split the file on a line basis:

import contextlib

file_large = 'thefile.rdf'
l = 1024*512  # lines per split file
with contextlib.ExitStack() as stack:
    fd_in = stack.enter_context(open(file_large, 'rb'))
    for i, line in enumerate(fd_in):
        if not i % l:
           file_split = '{}.{}'.format(file_large, i//l)
           fd_out = stack.enter_context(open(file_split, 'w'))
        fd_out.write('{}'.format(line))

This works well enough and quickly enough, except that it writes out the binary line as a string, so that when you read the file in a text editor you get 500k lines on a single line and the text reading like this:

…b'<dcterms:contributor>
'b'<rdf:Description>
'b'<rdfs:label>University of Durham</rdfs:label>
'b'<rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Organization" />
'b'</rdf:Description>
'b…

Which seems to indicate that it reads in the binary, and then writes out as a string. I tried changing the last couple of lines to:

           fd_out = stack.enter_context(open(file_split, 'w+b'))
        fd_out.write('{}'.format(bytearray(line)))

But then I get a Python error:

TypeError: a bytes-like object is required, not 'str'

Would therefore appreciate some pointers as to how to either solve the binary write issue, or a better way to correct the large XML file in situ.

Thanks

question from:https://stackoverflow.com/questions/66050276/how-to-write-file-to-binary-or-edit-single-line-in-large-file-python

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
Waitting for answers

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...