Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
431 views
in Technique[技术] by (71.8m points)

php - How would you code an anti plagiarism site?

First, please note, that I am interested in how something like this would work, and am not intending to build it for a client etc, as I'm sure there may already be open source implementations.

How do the algorithms work which detect plagiarism in uploaded text? Does it use regex to send all words to an index, strip out known words like 'the', 'a', etc and then see how many words are the same in different essays? Does it them have a magic number of identical words which flag it as a possible duplicate? Does it use levenshtein()?

My language of choice is PHP.

UPDATE

I'm thinking of not checking for plagiarism globally, but more say in 30 uploaded essays from a class. In case students have gotten together on a strictly one person assignment.

Here is an online site that claims to do so: http://www.plagiarism.org/

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Good plagiarism detection will apply heuristics based on the type of document (e.g. an essay or program code in a specific language).

However, you can also apply a general solution. Have a look at the Normalized Compression Distance (NCD). Obviously you cannot exactly calculate a text's Kolmogorov complexity, but you can approach it be simply compressing the text.

A smaller NCD indicates that two texts are more similar. Some compression algorithms will give better results than others. Luckily PHP provides support for several compression algorithms, so you can have your NCD-driven plagiarism detection code running in no-time. Below I'll give example code which uses Zlib:

PHP:

function ncd($x, $y) { 
  $cx = strlen(gzcompress($x));
  $cy = strlen(gzcompress($y));
  return (strlen(gzcompress($x . $y)) - min($cx, $cy)) / max($cx, $cy);
}   

print(ncd('this is a test', 'this was a test'));
print(ncd('this is a test', 'this text is completely different'));

Python:

>>> from zlib import compress as c
>>> def ncd(x, y): 
...     cx, cy = len(c(x)), len(c(y))
...     return (len(c(x + y)) - min(cx, cy)) / max(cx, cy) 
... 
>>> ncd('this is a test', 'this was a test')
0.30434782608695654
>>> ncd('this is a test', 'this text is completely different')
0.74358974358974361

Note that for larger texts (read: actual files) the results will be much more pronounced. Give it a try and report your experiences!


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...