Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
279 views
in Technique[技术] by (71.8m points)

python - FreqDist with NLTK

NLTK in python has a function FreqDist which gives you the frequency of words within a text. I am trying to pass my text as an argument but the result is of the form:

[' ', 'e', 'a', 'o', 'n', 'i', 't', 'r', 's', 'l', 'd', 'h', 'c', 'y', 'b', 'u', 'g', ' ', 'm', 'p', 'w', 'f', ',', 'v', '.', "'", 'k', 'B', '"', 'M', 'H', '9', 'C', '-', 'N', 'S', '1', 'A', 'G', 'P', 'T', 'W', '[', ']', '(', ')', '0', '7', 'E', 'J', 'O', 'R', 'j', 'x']

whereas in the example in the NLTK website the result was whole words not just letters. Im doing it this way:

file_y = open(fileurl)
p = file_y.read()
fdist = FreqDist(p)
vocab = fdist.keys()
vocab[:100]

DO you know what I have wrong pls? Thanks!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

FreqDist expects an iterable of tokens. A string is iterable --- the iterator yields every character.

Pass your text to a tokenizer first, and pass the tokens to FreqDist.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...