If you can find an SGML DTD for the documents that you work with, a solution could be to use the osx SGML to XML converter from the OpenSP SGML toolkit to turn the documents into XML.
Here is a simple example. Let's say that we have the following SGML document (company.sgml; with a root element):
<!DOCTYPE ROOT SYSTEM "company.dtd">
<ROOT>
<COMPANY>Awesome Corp
<FORM> 24-7
<ADDRESS>
<STREET>101 PARSNIP LN
<ZIP>31337
</ADDRESS>
The DTD (company.dtd) looks like this:
<!ELEMENT ROOT - o (COMPANY, FORM, ADDRESS) >
<!ELEMENT COMPANY - o (#PCDATA) >
<!ELEMENT FORM - o (#PCDATA) >
<!ELEMENT ADDRESS - - (STREET, ZIP) >
<!ELEMENT STREET - o (#PCDATA) >
<!ELEMENT ZIP - o (#PCDATA) >
The - o
bit means that the end tag can be omitted.
The SGML document can be parsed with osx, and the output can be formatted with xmllint, as follows:
osx company.sgml | xmllint --format -
Output from the above command:
<?xml version="1.0"?>
<ROOT>
<COMPANY>Awesome Corp</COMPANY>
<FORM> 24-7</FORM>
<ADDRESS>
<STREET>101 PARSNIP LN</STREET>
<ZIP>31337</ZIP>
</ADDRESS>
</ROOT>
Now we have well-formed XML that can be processed with lxml or other XML tools.
I don't know if there is a complete DTD for the document that you link to. The following PDF file contains related information about EDGAR, including a DTD that might be useful: http://www.sec.gov/info/edgar/pdsdissemspec910.pdf (I found it via this answer). But the linked SGML document contains elements (SEC-HEADER
, for example) that are not mentioned in the PDF file.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…