Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
357 views
in Technique[技术] by (71.8m points)

node.js - Adding missing XML closing tags in Javascript

I need to parse external files with the below structure using Node.js.

<ISSUER>
<COMPANY-DATA>
<CONFORMED-NAME>EXACTECH INC
<CIK>000012345
<ASSIGNED-SIC>9999
<IRS-NUMBER>8979898988
<STATE-OF-INCORPORATION>FL
<FISCAL-YEAR-END>1231
</COMPANY-DATA>
<BUSINESS-ADDRESS>
<STREET1>22W 56TH COURT
<CITY>GAINSVILLE
<STATE>FL
<ZIP>32653
<PHONE>999-999-9999
</BUSINESS-ADDRESS>
<MAIL-ADDRESS>
<STREET1>22W 56TH COURT
<CITY>GAINSVILLE
<STATE>FL
<ZIP>32653
</MAIL-ADDRESS>
</ISSUER>

The blocks have closing tags but individual lines do not. How can I add the missing closing tags so that I can parse the XML?

I do not have control over the XML file generation so cannot get it fixed at source.

This is similar to this Java implementation :Parsing XML with no closing tags in Java

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Your data looks like SGML, the superset of XML allowing tag inference/omission. I'm in the process of releasing an SGML parser for JavaScript (for the browser, node.js and other CommonJS platforms) but it's not released yet. For the time being, I suggest to use the venerable OpenSP software, which doesn't have an npm integration package, but which you can easily install on eg. Ubuntu/Debian using sudo apt-get install opensp, and similar on other Linuxen and on Mac OS via MacPorts.

The OpenSP package contains the osx command line utility to down-convert SGML to XML. You can use the node child_process core package to invoke the osx program, pipe it your SGML data, and grab the XML output produced by it, and then feed the produced XML to the XML parser of your choice in your node app.

SGML and the osx program must be told to add the omitted end-element tags for CONFORMED-NAME, CIK, and the other elements with omitted end-element tags. You do that by prepending a document type declaration (DTD) before your SGML content. In your case, what you supply to the osx program should look as follows:

<!DOCTYPE ISSUER [
  <!ELEMENT ISSUER - -
     (COMPANY-DATA,BUSINESS-ADDRESS,MAIL-ADDRESS)>
  <!ELEMENT COMPANY-DATA - -
     (CONFORMED-NAME,CIK,ASSIGNED-SIC,IRS-NUMBER,
       STATE-OF-INCORPORATION,FISCAL-YEAR-END)>
  <!ELEMENT (BUSINESS-ADDRESS,MAIL-ADDRESS) - -
     (STREET1,CITY,STATE,ZIP)>
  <!ELEMENT
     (CONFORMED-NAME,CIK,ASSIGNED-SIC,IRS-NUMBER,
       STATE-OF-INCORPORATION,FISCAL-YEAR-END,
       STREET1,CITY,STATE,ZIP) - O (#PCDATA)>
]>
<ISSUER> ... rest of your input data followin here

Crucially, the declaration for the CONFORMED-NAME, CIK, and the other field-like elements use - O (hyphen-minus and letter O) as tag omission indicators, telling SGML that the end-element tags for these elements can be omitted, and will be inserted automatically by the osx program.

You can read more about the meaning of these declarations on my project page at http://sgmljs.net/docs/sgmlrefman.html .


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...