Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
440 views
in Technique[技术] by (71.8m points)

jsoup HTML fragment detection

I'm parsing a html fragment without knowing that this is a fragment. I use the jsoup HTML parser. For example:

    String html = "<script>document.location = "http://example.com/";</script>";
    Document document = Jsoup.parse(html);
    System.out.println(document.html());

Output:

<html>
   <head>
     <script>document.location = "http://example.com/";</script>
   </head>
  <body></body>
</html>

Question: Is there a way to know that the <html>, <head> and <body> tags were added by Jsoup and were not in the original html fragment?

Update:

I also tried to enable the errors tracking:

Parser parser = Parser.htmlParser();
parser.setTrackErrors(500);
Document document = parser.parseInput(html, "example.com");
ParseErrorList errors = parser.getErrors();

But I get an empty list of errors.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The simplest way to do this would be to parse it both as XML and as HTML, and compare the element counts of both results. The XML parser does not automatically add elements, whereas the HTML parser automatically adds missing optional tags and performs other normalization.

Here's an example:

@Test public void detectAutoElements() {
    String bare = "<script>One</script>";
    String full =
       "<html><head><title>Check</title></head><body><p>One</p></body></html>";

    assertTrue(didAddElements(bare));
    assertFalse(didAddElements(full));
}

private boolean didAddElements(String input) {
    // two passes, one as XML and one as HTML. XML does not vivify missing/optional tags
    Document html = Jsoup.parse(input);
    Document xml = Jsoup.parse(input, "", Parser.xmlParser());

    int htmlElementCount = html.getAllElements().size();
    int xmlElementCount = xml.getAllElements().size();
    boolean added = htmlElementCount > xmlElementCount;

    System.out.printf(
      "Original input has %s elements; HTML doc has %s. Is a fragment? %s
",
      xmlElementCount, htmlElementCount, added);

    return added;
}

This gives the result:

Original input has 2 elements; HTML doc has 5. Is a fragment? true
Original input has 6 elements; HTML doc has 6. Is a fragment? false

Depending on your need, you could potentially extend this to more deeply compare the two document structures.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...