Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
356 views
in Technique[技术] by (71.8m points)

Text Extraction from HTML Java

I'm working on a program that downloads HTML pages and then selects some of the information and write it to another file.

I want to extract the information which is intbetween the paragraph tags, but i can only get one line of the paragraph. My code is as follows;

FileReader fileReader = new FileReader(file);
BufferedReader buffRd = new BufferedReader(fileReader);
BufferedWriter out = new BufferedWriter(new FileWriter(newFile.txt));
String s;

while ((s = br.readLine()) !=null) {
    if(s.contains("<p>")) {
        try {
            out.write(s);
        } catch (IOException e) {
        }
    }
}

i was trying to add another while loop, which would tell the program to keep writing to file until the line contains the </p> tag, by saying;

while ((s = br.readLine()) !=null) {
    if(s.contains("<p>")) {
        while(!s.contains("</p>") {
            try {
                out.write(s);
            } catch (IOException e) {
            }
        }
    }
}

But this doesn't work. Could someone please help.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

jsoup

Another html parser I really liked using was jsoup. You could get all the <p> elements in 2 lines of code.

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements ps = doc.select("p");

Then write it out to a file in one more line

out.write(ps.text());  //it will append all of the p elements together in one long string

or if you want them on separate lines you can iterate through the elements and write them out separately.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...