parsing - Nutch/Elastic Search terms definition

Question

Welcome To Ask or Share your Answers For Others

parsing - Nutch/Elastic Search terms definition

posted Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

parsing - Nutch/Elastic Search terms definition

I used nutch and Elastisearch to crawl/parse 99 websites/links in order to index them in Elasicsearch so that I can use the search engine. It did crawl all the 99 websites/links but the end message I get is as follows. I am trying to understand what redirects, add/update mean? And if it is possible to find out which are gone and redirects?

Indexer: number of documents indexed, deleted, or skipped:
Indexer:      5  deleted (gone)
Indexer:      8  deleted (redirects)
Indexer:     76  indexed (add/update)
Indexer: finished at 2020-12-17 13:07:19, elapsed: 00:00:08

question from:https://stackoverflow.com/questions/65599557/nutch-elastic-search-terms-definition

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T18:50:43+0000

Nutch does not know whether a page is already in the index. In order to keep the index and the crawled content in sync,

successfully fetched pages are sent to the index and counted as additions or updates
(with indexer option -deleteGone) 404s and otherwise failed fetches are deleted from the index and counted as "gone"
same for redirects but counted separately as "redirects"

And if it is possible to find out which are gone and redirects?

You can use the Nutch tools

readdb to dump the CrawlDb
readseg to dump the segment which was indexed

and then search for 404s, fetch failures, redirects, etc. Calling bin/nutch readdb resp. bin/nutch readseg will show you all available command-line options.

Categories

parsing - Nutch/Elastic Search terms definition

parsing - Nutch/Elastic Search terms definition

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags