Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
255 views
in Technique[技术] by (71.8m points)

python - 'ParseError: not well-formed' Error for some Special characters in XML Data

I have the following code to cleanup the log file to get XML out of it (log file is not well formatted and doesn't have root) and then parse and perform other functions. Clean up works, but But XML Parser is throwing me error for some xml data which contain some special characters. My code is as below:

with open(log_file, 'r') as fr, open('XMLinLog2.xml', 'w') as fw:
    fw.write("<document>
")

    for line in fr:
        if line.strip().startswith('<'):
            fw.write('	' + line)
    fw.write("
</document>")

# --- Parsing Log files after cleanup ---

doc = ET.parse('XMLinLog2.xml')

The xml data in log file which throws me error is for; (1) Ops Désactivée 23:59 and (2) [ mono @ 90° >> +1 which after cleanup in the log file is shown as Ops D?sactiv?e 23:59 and [ mono @ 90? >> +1 respectively. So I figured out ? character is causing issues. Question:

  1. How do I deal with this error?
  2. If I need to print the those data, how can I print them correctly? I dont want to print ?. Because I assume it will throw error whenever I have french text coming in for é .

Full error here: raceback (most recent call last): File "C:/Users/PycharmProjects/IMSS_TestHarness/Libraries/try.py", line 23, in doc = ET.parse('XMLinLog2.xml') File "C:UsersAppDataLocalProgramsPythonPython38-32libxmletreeElementTree.py", line 1202, in parse tree.parse(source, parser) File "C:UsersAppDataLocalProgramsPythonPython38-32libxmletreeElementTree.py", line 595, in parse self._root = parser._parse_whole(source) xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 3299, column 22

Process finished with exit code 1

Log file:

1.  2020-08-03 15:59:54.635 (72 ,Effective Commit) Info          Sending:
<U_DisplayCommand>
  <DestinationId>5035</DestinationId>
  <DisplayId>1</DisplayId>
  <LineTextEnglish>
    <Line>Ops Disabled 23:59 N</Line>
  </LineTextEnglish>
  <LineTextFrench>
    **<Line>Ops Désactivée 23:59</Line>**
  </LineTextFrench>
</U_DisplayCommand>
<U_DisplayCommand>
  <DestinationId>5085</DestinationId>
  <DisplayId>1</DisplayId>
  <LineTextEnglish>
    <Line>Vaudreuil-Dori P123A</Line>
    <Line>[ mono @ 90° &gt;&gt; +1</Line>
  </LineTextEnglish>
  <LineTextFrench>
    <Line>Vaudreuil-Dori P123A</Line>
    <Line>[ mono @ 90° &gt;&gt; +1</Line>
  </LineTextFrench>
</U_DisplayCommand>

Thanks in advance.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Actually adding encoding worked for me.

with open(log_file, 'r') as fr, open('XMLinLog2.xml', 'w', encoding='utf-8') as fw:
    fw.write("<document>
")

    for line in fr:
        if line.strip().startswith('<'):
            fw.write('	' + line)
    fw.write("
</document>")

# --- Parsing Log files after cleanup ---

doc = ET.parse('XMLinLog2.xml')

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...