algorithm - How to detect duplicates among text documents and return the duplicates' similarity?

Question

Welcome To Ask or Share your Answers For Others

algorithm - How to detect duplicates among text documents and return the duplicates' similarity?

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

algorithm - How to detect duplicates among text documents and return the duplicates' similarity?

I'm writing a crawler to get content from some website, but the content can duplicated, I want to avoid that. So I need a function can return the same percent between two text to detect two content maybe duplicated Example:

Text 1:"I'm writing a crawler to"
Text 2:"I'm writing a some text crawler to get"

The compare function will return text 2 as the same text 1 by 5/8%(with 5 is words number of text 2 same text 1(compare by word order), and 8 is total words of text 2). If remove the "some text" then text 2 as the same text 1(I need detect the situation).How can I do that?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T00:50:16+0000

You are facing a problem which is known in the field of Information Retrieval as Near Duplicates Detection.

One of the known solutions to it is to use Jaccard-Similarity for getting the difference between two documents.

Jaccard Similarity is basically - get sets of words from each document, let these sets be s1 and s2 - and the jaccard similarity is |s1 [intersection] s2|/|s1 [union] s2|.

Usually when facing near duplicates - the order of words has some importance however. In order to deal with it - when generating the sets s1 and s2 - you actually generate sets of k-shinglings, instead sets of only words.
In your example, with k=2, the sets will be:

s1 = { I'm write, write a, a crawler, crawler to }
s2 = { I'm write, write a, a some, some text, text crawler, crawler to, to get }
s1 [union] s2 = { I'm write, write a, a crawler, crawler to, a some, some text, text crawler, to get } 
s1 [intersection] s2 = { I'm write, write a, crawler to }

In the above, the jaccard-similarity will be 3/8. If you use single words with the same approach, (k=1 shinglings) you will get your desired 5/8 - but this is worse solution in my (and most IR experts) opinion.

This procedure can be scaled nicely to deal very efficiently with huge collections, without checking all pairs and creating huge numbers of sets. More details could be found in these lecture notes (I gave this lecture few months ago, based on the author's notes).

Categories

algorithm - How to detect duplicates among text documents and return the duplicates' similarity?

algorithm - How to detect duplicates among text documents and return the duplicates' similarity?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags