Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
883 views
in Technique[技术] by (71.8m points)

xml - Parse SGML with Open Arbitrary Tags in Python 3

I am trying to parse a file such as: http://www.sec.gov/Archives/edgar/data/1409896/000118143112051484/0001181431-12-051484.hdr.sgml

I am using Python 3 and have been unable to find a solution with existing libraries to parse an SGML file with open tags. SGML allows implicitly closed tags. When attempting to parse the example file with LXML, XML, or beautiful soup I end up with implicitly closed tags being closed at the end of the file instead of at the end of line.

For example:

<COMPANY>Awesome Corp
<FORM> 24-7
<ADDRESS>
<STREET>101 PARSNIP LN
<ZIP>31337
</ADDRESS>

This ends up being interpreted as:

<COMPANY>Awesome Corp
<FORM> 24-7
<ADDRESS>
<STREET>101 PARSNIP LN
<ZIP>31337
</ADDRESS>
</ZIP>
</STREET>
</FORM>
</COMPANY>

However, I need it to be interpreted as:

<COMPANY>Awesome Corp</COMPANY>  
<FORM> 24-7</FORM>
<ADDRESS>
<STREET>101 PARSNIP LN</STREET>
<ZIP>31337</ZIP>
</ADDRESS>

If there's a non-default parser to pass to LXML/BS4 that can handle this I'm missing it.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

If you can find an SGML DTD for the documents that you work with, a solution could be to use the osx SGML to XML converter from the OpenSP SGML toolkit to turn the documents into XML.

Here is a simple example. Let's say that we have the following SGML document (company.sgml; with a root element):

<!DOCTYPE ROOT SYSTEM "company.dtd">
<ROOT>
<COMPANY>Awesome Corp
<FORM> 24-7
<ADDRESS>
<STREET>101 PARSNIP LN
<ZIP>31337
</ADDRESS>

The DTD (company.dtd) looks like this:

<!ELEMENT ROOT       -  o (COMPANY, FORM, ADDRESS) >
<!ELEMENT COMPANY    -  o (#PCDATA) >
<!ELEMENT FORM       -  o (#PCDATA) >
<!ELEMENT ADDRESS    -  - (STREET, ZIP) >
<!ELEMENT STREET     -  o (#PCDATA) >
<!ELEMENT ZIP        -  o (#PCDATA) >

The - o bit means that the end tag can be omitted.

The SGML document can be parsed with osx, and the output can be formatted with xmllint, as follows:

osx company.sgml | xmllint --format -

Output from the above command:

<?xml version="1.0"?>
<ROOT>
  <COMPANY>Awesome Corp</COMPANY>
  <FORM> 24-7</FORM>
  <ADDRESS>
    <STREET>101 PARSNIP LN</STREET>
    <ZIP>31337</ZIP>
  </ADDRESS>
</ROOT>

Now we have well-formed XML that can be processed with lxml or other XML tools.

I don't know if there is a complete DTD for the document that you link to. The following PDF file contains related information about EDGAR, including a DTD that might be useful: http://www.sec.gov/info/edgar/pdsdissemspec910.pdf (I found it via this answer). But the linked SGML document contains elements (SEC-HEADER, for example) that are not mentioned in the PDF file.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...