Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
550 views
in Technique[技术] by (71.8m points)

java tika how to convert html to plain text retaining specific element

The code below works perfectly in converting html to plain text...

Url url = new URL(your_url);
InputStream is = url.openStream(); 
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
AutoDetectParser parser = new AutoDetectParser();
ParseContext context = new ParseContext();
parser.parse(is, textHandler, metadata, context);
System.out.println("Body: " + textHandler.toString());

My question is: How to retain / keep specific element like links, , etc... or how to prevent specific element like links, to be removed in html to plain text conversion?

Thanks and best regards...

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

There are many ways you can use Apache Tika for this kind of work.

As Gagravarr says, the ContentHandler you use is the key here. There are are number of useful ones that could help here, or you can build your own custom one.

Since I'm not sure what you are looking to do, I've tried to share some examples of common approaches, particularly for HTML content.

MatchingContentHandler

A common route is to use a MatchingContentHandler to filter the content you are interested in:

URL url = new URL("http://tika.apache.org");
InputStream is = url.openStream();

// Only select <a> tags to be output
XPathParser xhtmlParser = new XPathParser("xhtml", XHTMLContentHandler.XHTML);
Matcher divContentMatcher = xhtmlParser.parse("//xhtml:a/descendant::node()");
MatchingContentHandler handler = new MatchingContentHandler(new ToHTMLContentHandler(), divContentMatcher);

// Parse based on original question
HtmlParser parser = new HtmlParser()
Metadata metadata = new Metadata();
parser.parse(is, handler, metadata, new ParseContext());
System.out.println("Links: " + handler.toString());

It's worth noting this is for inclusion only and only supports a sub-set of XPath. See XPathParser for details.

LinkContentHandler

If you just want to extract links, the LinkContentHandler is a great option:

URL url = new URL("http://tika.apache.org");
InputStream is = url.openStream();
LinkContentHandler linkHandler = new LinkContentHandler();
Metadata metadata = new Metadata();
HtmlParser parser = new HtmlParser();
parser.parse(is, linkHandler, metadata, new ParseContext());
System.out.println("Links: " + linkHandler.getLinks());

It's code is also a great example of how to build a custom handler.

BoilerpipeContentHandler

The BoilerpipeContentHandler uses the Boilerpipe library underneath allowing you to use one of it's defined extractors to process the content.

URL url = new URL("http://tika.apache.org");
InputStream is = url.openStream();

ExtractorBase extractor = ArticleSentencesExtractor.getInstance();
BoilerpipeContentHandler textHandler = new BoilerpipeContentHandler(new BodyContentHandler(), extractor);

Metadata metadata = new Metadata();
HtmlParser parser = new HtmlParser();
parser.parse(is, textHandler, metadata, new ParseContext());

System.out.println(textHandler.getTextDocument().getTextBlocks());

These can be really useful if you are really interesting in the content inside, as the extractors can help you focus on the content.

Custom ContentHandler or ContentHandlerDecorator

You can build your own ContentHandler to do custom processing and get exactly what you want out from a file.

In some cases this could be to write out specific content, such in the example below, or in other cases it could be processing such as collecting and making available links, as in the LinkHandler.

Using custom ContentHandler instances is really powerful and there are a ton of examples available in the Apache Tika code base, as well other open source projects too.

Below is a bit of a contrived example, just trying to emit part of the HTML:

URL url = new URL("http://tika.apache.org");
InputStream is = url.openStream();

StringWriter sw = new StringWriter();
SAXTransformerFactory factory = (SAXTransformerFactory) SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "html");
handler.setResult(new StreamResult(sw));

ContentHandlerDecorator h2Handler = new ContentHandlerDecorator(handler) {
    private final List<String> elementsToInclude = List.of("h2");
    private boolean processElement = false;

    @Override
    public void startElement(String uri, String local, String name, Attributes atts)
            throws SAXException {
        if (elementsToInclude.contains(name)) {
            processElement = true;
            super.startElement(uri, local, name, atts);
        }
    }

    @Override
    public void ignorableWhitespace(char[] ch, int start, int length) {
        // Skip whitespace
    }

    @Override
    public void characters(char[] ch, int start, int length) throws SAXException {
        if (!processElement) {
            return;
        }
        super.characters(ch, start, length);
    }

    @Override
    public void endElement(
            String uri, String local, String name) throws SAXException {
        if (elementsToInclude.contains(name)) {
            processElement = false;
            super.endElement(uri, local, name);
        }
      }
};

HtmlParser parser = new HtmlParser();
parser.parse(is, h2Handler, new Metadata(), new ParseContext());
System.out.println("Heading Level 2s: " + sw.toString());

As you can see in some of these examples one or more ContentHandler instances can be chained together but it's worth noting some expect the output to be well-formed, so check the Javadocs. XHTMLContentHandler is also useful if you want to map different files into your own common format.

JSoup :)

Another route is JSoup, either directly (skip the Tika part and use Jsoup.connect()) if you are processing HTML directly, or chained with Apache Tika, if you are wanting to read from HTML generated from different file types.

URL url = new URL("http://tika.apache.org");
InputStream is = url.openStream();
ToHTMLContentHandler html = new ToHTMLContentHandler();
HtmlParser parser = new HtmlParser();
parser.parse(is, html, new Metadata(), new ParseContext());
Document doc = Jsoup.parse(html.toString());
Elements h2List = doc.select("h2");
for (Element headline : h2List) {
    System.out.println(headline.text());
}   

Once you've parsed it, you can query the document with Jsoup. Not the most efficient vs a ContentHandler built for the job, but can be useful for messy content sets.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...