Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
886 views
in Technique[技术] by (71.8m points)

ms word - How to extract text inserted with track-changes in python-docx

I want to extract text from word documents that were edited in "Track Changes" mode. I want to extract the inserted text and ignore the deleted text.

Running the below code I saw that paragraphs inserted in "track changes" mode return an empty Paragraph.text

import docx

doc = docx.Document('C:\test track changes.docx')

for para in doc.paragraphs:
    print(para)
    print(para.text)

Is there a way to retrieve the text in revisioned inserts (w:ins elements) ?

I'm using python-docx 0.8.6, lxml 3.4.0, python 3.4, Win7

Thanks

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Not directly using python-docx; there's no API support yet for tracked changes/revisions.

It's a pretty tricky job, which you'll discover if you search on the element names, perhaps 'open xml w:ins' for a start, that brings up this document as the first result: https://msdn.microsoft.com/en-us/library/ee836138(v=office.12).aspx

If I needed to do something like that in a pinch I'd get the body element using:

body = document._body._body

and then use XPath on that to return the elements I wanted, something vaguely like this aircode:

from docx.text.paragraph import Paragraph

inserted_ps = body.xpath('./w:ins//w:p')
for p in inserted_ps:
    paragraph = Paragraph(p, None)
    print(paragraph.text)

You'll be on your own for figuring out what XPath expression will get you the paragraphs you want.

opc-diag may be a friend in this, allowing you to quickly scan the XML of the .docx package. http://opc-diag.readthedocs.io/en/latest/index.html


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...