Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
208 views
in Technique[技术] by (71.8m points)

c# - How can I bring google-like recrawling in my application(web or console)

How can I bring google-like recrawling in my application(web or console). I need only those pages to be recrawled which are updated after a particular date.

The LastModified header in the System.Net.WebResponse gives only the current date of the server. For example if I downloaded one page with HTTPWebRequest on 27 January 2012, and check the header for the LastModified date, it is showing the current time of the server when the page was served. In this case it is 27 January 2012 only.

Can anyone suggest any other methods?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

First, to point out here is that what you're trying to do is very difficult and there is a great deal of research-level papers that try to address it (I will give you links to a few of them a little later). There is no way to see if a site has changed without crawling it, although you can have shortcuts like checking the Content-Length from the response header without downloading the rest of the page. This will allow your system to save on traffic, but it won't resolve your problem in a manner that's really useful.

Second, since you're concerned about content, then Last-Modified header field will not be very useful for you and I would even go as far as to say that it will not be useful at all.

And third, what you're describing has somewhat conflicting requirements, because you're interested in crawling only the pages that have updated content and that's not exactly how Google does things (yet, you want google-like crawling). Google's crawling is focused on providing the freshest content for the most frequently searched/visited websites. For example: Google has very little interest in frequently crawling a website that updates its content twice a day when that website has 10 visitors a day, instead Google is more interested in crawling a website that gets 10 million visitors a day even if its content updates less frequently. It may be also true that websites that update their content frequently also have a lot of visitors, but from Google's perspective that's not exactly relevant.


If you have to discover new websites (coverage) and at the same time you want to have the latest content of the sites you know about (freshness), then you have conflicting goals (which is true for most crawlers, even Google). Usually what ends up happening is that when you have more coverage you have less freshness and if you have more freshness then you have less coverage. If you're interested in balancing both, then I suggest you read the following articles:

The summary of the idea is that you have to crawl a website several times (maybe several hundred times) in order for you to build up a good measure of its history. Once you have a good set of historical measures, then you use a predictive model to interpolate when will the website change again and you schedule a crawl for some time after the expected change.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...