python - How is the Tf-Idf value calculated with analyzer ='char'?

Question

Welcome To Ask or Share your Answers For Others

python - How is the Tf-Idf value calculated with analyzer ='char'?

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - How is the Tf-Idf value calculated with analyzer ='char'?

I'm having a problem in understanding how we got the Tf-Idf in the following program:

I have tried calculating the value of a in the document 2 ('And_this_is_the_third_one.') using the concept given on the site, but my value of 'a' using the above concept is

1/26*log(4/1)

((count of occurrence of 'a' character)/(no of characters in the given document)*log( # Docs/ # Docs in which given character occurred))

= 0.023156

But output is returned as 0.2203 as can be seen in the output.

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ['This_is_the_first_document.', 'This_document_is_the_second_document.', 'And_this_is_the_third_one.', 'Is_this_the_first_document?', ]
vectorizer = TfidfVectorizer(min_df=0.0, analyzer="char")
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(vectorizer.vocabulary_)
m = X.todense()
print(m)

I expected the output to be 0.023156 using the concept explained above.

The output is:

['.', '?', '_', 'a', 'c', 'd', 'e', 'f', 'h', 'i', 'm', 'n', 'o', 'r', 's', 't', 'u']


{'t': 15, 'h': 8, 'i': 9, 's': 14, '_': 2, 'e': 6, 'f': 7, 'r': 13, 'd': 5, 'o': 12, 'c': 4, 'u': 16, 'm': 10, 'n': 11, '.': 0, 'a': 3, '?': 1}


[[0.14540332 0.         0.47550697 0.         0.14540332 0.11887674
  0.23775349 0.17960203 0.23775349 0.35663023 0.14540332 0.11887674
  0.11887674 0.14540332 0.35663023 0.47550697 0.14540332]


 [0.10814145 0.         0.44206359 0.         0.32442434 0.26523816
  0.35365088 0.         0.17682544 0.17682544 0.21628289 0.26523816
  0.26523816 0.         0.26523816 0.35365088 0.21628289]


 [0.14061506 0.         0.57481012 0.22030066 0.         0.22992405
  0.22992405 0.         0.34488607 0.34488607 0.         0.22992405
  0.11496202 0.14061506 0.22992405 0.34488607 0.        ]


 [0.         0.2243785  0.46836004 0.         0.14321789 0.11709001
  0.23418002 0.17690259 0.23418002 0.35127003 0.14321789 0.11709001
  0.11709001 0.14321789 0.35127003 0.46836004 0.14321789]]

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:42:21+0000

The TfidfVectorizer() has smoothing added to the document counts and l2 normalization been applied on top tf-idf vector, as mentioned in the documentation.

(count of occurrence of the character)/(no of characters in the given document) *
log (1 + # Docs / 1 + # Docs in which the given character is present) +1 )

This Normalization is l2 by default, but you can change or remove this step by using the parameter norm. Similarly, smoothing can be

To understand how does the exact score is computed, I am going to fit a CountVectorizer() to know the counts of each character in every document.

countVectorizer = CountVectorizer(analyzer='char')
tf = countVectorizer.fit_transform(corpus)
tf_df = pd.DataFrame(tf.toarray(),
                     columns= countVectorizer.get_feature_names())
tf_df

#output:
   .  ?  _  a  c  d  e  f  h  i  m  n  o  r  s  t  u
0  1  0  4  0  1  1  2  1  2  3  1  1  1  1  3  4  1
1  1  0  5  0  3  3  4  0  2  2  2  3  3  0  3  4  2
2  1  0  5  1  0  2  2  0  3  3  0  2  1  1  2  3  0
3  0  1  4  0  1  1  2  1  2  3  1  1  1  1  3  4  1

Let us apply the tf-idf weighting based on sklearn implementation now for the second document now!

v=[]
doc_id = 2
# number of documents in the corpus + smoothing
n_d = 1+ tf_df.shape[0]

for char in tf_df.columns:
    # calculate tf - count of this char in the doc / total number chars in the doc
    tf = tf_df.loc[doc_id,char]/tf_df.loc[doc_id,:].sum()

    # number of documents containing this char with smoothing 
    df_d_t = 1+ sum(tf_df.loc[:,char]>0)
    # now calculate the idf with smoothing 
    idf = (np.log (n_d/df_d_t) + 1 )

    # calculate the score now
    v.append (tf*idf)

from sklearn.preprocessing import normalize

# normalize the vector with l2 norm and create a dataframe with feature_names

pd.DataFrame(normalize([v], norm='l2'), columns=vectorizer.get_feature_names())

#output:

       .    ?        _         a    c         d         e    f         h        i    m         n         o         r         s         t    u  
 0.140615  0.0  0.57481  0.220301  0.0  0.229924  0.229924  0.0  0.344886   0.344886  0.0  0.229924  0.114962  0.140615  0.229924  0.344886  0.0

you could find that the score for char a matches with the TfidfVectorizer() output!!!

Categories

python - How is the Tf-Idf value calculated with analyzer ='char'?

python - How is the Tf-Idf value calculated with analyzer ='char'?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags