Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
474 views
in Technique[技术] by (71.8m points)

information retrieval - Why is log used when calculating term frequency weight and IDF, inverse document frequency?

The formula for IDF is log( N / df t ) instead of just N / df t.

Where N = total documents in collection, and df t = document frequency of term t.

Log is said to be used because it “dampens” the effect of IDF. What does this mean?

Also, why do we use log frequency weighing for term frequency as seen here:

enter image description here

question from:https://stackoverflow.com/questions/27067992/why-is-log-used-when-calculating-term-frequency-weight-and-idf-inverse-document

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Debasis's answer is correct. I am not sure why he got downvoted.

Here is the intuition: If term frequency for the word 'computer' in doc1 is 10 and in doc2 it's 20, we can say that doc2 is more relevant than doc1 for the word 'computer.

However, if the term frequency of the same word, 'computer', for doc1 is 1 million and doc2 is 2 millions, at this point, there is no much difference in terms of relevancy anymore because they both contain a very high count for term 'computer'.

Just like Debasis's answer, adding log is to dampen the importance of term that has a high frequency, e.g. Using log base 2, the count of 1 million will be reduced to 19.9!

We also add 1 to the log(tf) because when tf is equal to 1, the log(1) is zero. By adding one, we distinguish between tf=0 and tf=1.

Hope this helps!


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...