python - List the words in a vocabulary according to occurrence in a text corpus, with Scikit-Learn CountVectorizer

Question

Welcome To Ask or Share your Answers For Others

python - List the words in a vocabulary according to occurrence in a text corpus, with Scikit-Learn CountVectorizer

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - List the words in a vocabulary according to occurrence in a text corpus, with Scikit-Learn CountVectorizer

I have fitted a CountVectorizer to some documents in scikit-learn. I would like to see all the terms and their corresponding frequency in the text corpus, in order to select stop-words. For example

'and' 123 times, 'to' 100 times, 'for' 90 times, ... and so on

Is there any built-in function for this?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:43:59+0000

If cv is your CountVectorizer and X is the vectorized corpus, then

zip(cv.get_feature_names(),
    np.asarray(X.sum(axis=0)).ravel())

returns a list of (term, frequency) pairs for each distinct term in the corpus that the CountVectorizer extracted.

(The little asarray + ravel dance is needed to work around some quirks in scipy.sparse.)

Categories

python - List the words in a vocabulary according to occurrence in a text corpus, with Scikit-Learn CountVectorizer

python - List the words in a vocabulary according to occurrence in a text corpus, with Scikit-Learn CountVectorizer

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags