Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
2.6k views
in Technique[技术] by (71.8m points)

java 爬虫代码优化

package com.company;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClientBuilder;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import org.apache.http.HttpEntity;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
public class MingLuSpider {
    private String ReponseBody;
    public void MingLuSpider() {
        MingLuSpider mingspider = new MingLuSpider();
    }
    public void GetRequestData(String url) throws IOException {
        String ResponseBody = null;
        String ResponseInsideBody=null;
        try {
            CloseableHttpClient httpClient = HttpClients.createDefault();
            HttpGet httpGet = new HttpGet(url);
            httpGet.setHeader("User-Agent", "Mozilla/5.0(Windows NT 6.1;Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0");
            CloseableHttpResponse response = httpClient.execute(httpGet);
            HttpEntity httpEntity = response.getEntity();
            ResponseBody = EntityUtils.toString(httpEntity, "utf-8");
            Document document = Jsoup.parse(ResponseBody);
            Elements getItems = document.select("td[class='views-field views-field-name']");
            for (Element getItem : getItems) {
                String link = "https://gongshang.mingluji.com" + getItem.select("a").attr("href");
                System.out.println("每个公司链接为:" + link);
                HttpGet GetInsideDate = new HttpGet(link);
                GetInsideDate.setHeader("User-Agent", "Mozilla/5.0(Windows NT 6.1;Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0");
                CloseableHttpResponse ResponseInside = httpClient.execute(GetInsideDate);
                HttpEntity httpinsideEntity = ResponseInside.getEntity();
                ResponseInsideBody = EntityUtils.toString(httpinsideEntity, "utf-8");
                System.out.println(ResponseInsideBody);
                System.out.println("这个链接为");
                System.out.println(link);
            }
            response.close();
            httpClient.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

这里的 每次都需要new new HttpGet(link);请求新的url,而且每次还需set同样的header,有没有不需要new 的而且不需要set header 一次设置就可以了?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

试试fluent-hc吧,是httpclient的官方包装,使用起来比httpclient方便太多了。


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...