Nutch does not know whether a page is already in the index. In order to keep the index and the crawled content in sync,
- successfully fetched pages are sent to the index and counted as additions or updates
- (with indexer option
-deleteGone
) 404s and otherwise failed fetches are deleted from the index and counted as "gone"
- same for redirects but counted separately as "redirects"
And if it is possible to find out which are gone and redirects?
You can use the Nutch tools
readdb
to dump the CrawlDb
readseg
to dump the segment which was indexed
and then search for 404s, fetch failures, redirects, etc. Calling bin/nutch readdb
resp. bin/nutch readseg
will show you all available command-line options.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…