information retrieval - Why is log used when calculating term frequency weight and IDF, inverse document frequency?

Question

Welcome To Ask or Share your Answers For Others

information retrieval - Why is log used when calculating term frequency weight and IDF, inverse document frequency?

1 Reply

深蓝 · Answer 1 · 2021-10-06T05:23:29+0000

Debasis's answer is correct. I am not sure why he got downvoted.

Here is the intuition: If term frequency for the word 'computer' in doc1 is 10 and in doc2 it's 20, we can say that doc2 is more relevant than doc1 for the word 'computer.

However, if the term frequency of the same word, 'computer', for doc1 is 1 million and doc2 is 2 millions, at this point, there is no much difference in terms of relevancy anymore because they both contain a very high count for term 'computer'.

Just like Debasis's answer, adding log is to dampen the importance of term that has a high frequency, e.g. Using log base 2, the count of 1 million will be reduced to 19.9!

We also add 1 to the log(tf) because when tf is equal to 1, the log(1) is zero. By adding one, we distinguish between tf=0 and tf=1.

Hope this helps!

Categories

information retrieval - Why is log used when calculating term frequency weight and IDF, inverse document frequency?

information retrieval - Why is log used when calculating term frequency weight and IDF, inverse document frequency?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags