Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
260 views
in Technique[技术] by (71.8m points)

python - How to prevent splitting specific words or phrases and numbers in NLTK?

I have a problem in text matching when I tokenize text that splits specific words, dates and numbers. How can I prevent some phrases like "run in my family" ,"30 minute walk" or "4x a day" from splitting at the time of tokenizing words in NLTK?

They should not result in:

['runs','in','my','family','4x','a','day']

For example:

Yes 20-30 minutes a day on my bike, it works great!!

gives:

['yes','20-30','minutes','a','day','on','my','bike',',','it','works','great']

I want '20-30 minutes' to be treated as a single word. How can I get this behavior>?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You will be hard pressed to preserve n-grams of various length at the same time as tokenizing, to my knowledge, but you can find these n-grams as shown here. Then, you could replace the items in the corpus you want as n-grams with some joining character like dashes.

This is an example solution, but there are probably lots of ways to get there. Important note: I provided a way to find ngrams that are common in the text (you will probably want more than 1, so I put a variable there so that you can decide how many of the ngrams to collect. You might want a different number for each kind, but I only gave 1 variable for now.) This may miss ngrams you find important. For that, you can add ones you want to find to user_grams. Those will get added to the search.

import nltk 

#an example corpus
corpus='''A big tantrum runs in my family 4x a day, every week. 
A big tantrum is lame. A big tantrum causes strife. It runs in my family 
because of our complicated history. Every week is a lot though. Every week
I dread the tantrum. Every week...Here is another ngram I like a lot'''.lower()

#tokenize the corpus
corpus_tokens = nltk.word_tokenize(corpus)

#create ngrams from n=2 to 5
bigrams = list(nltk.ngrams(corpus_tokens,2))
trigrams = list(nltk.ngrams(corpus_tokens,3))
fourgrams = list(nltk.ngrams(corpus_tokens,4))
fivegrams = list(nltk.ngrams(corpus_tokens,5))

This section finds common ngrams up to five_grams.

#if you change this to zero you will only get the user chosen ngrams
n_most_common=1 #how many of the most common n-grams do you want.

fdist_bigrams = nltk.FreqDist(bigrams).most_common(n_most_common) #n most common bigrams
fdist_trigrams = nltk.FreqDist(trigrams).most_common(n_most_common) #n most common trigrams
fdist_fourgrams = nltk.FreqDist(fourgrams).most_common(n_most_common) #n most common four grams
fdist_fivegrams = nltk.FreqDist(fivegrams).most_common(n_most_common) #n most common five grams

#concat the ngrams together
fdist_bigrams=[x[0][0]+' '+x[0][1] for x in fdist_bigrams]
fdist_trigrams=[x[0][0]+' '+x[0][1]+' '+x[0][2] for x in fdist_trigrams]
fdist_fourgrams=[x[0][0]+' '+x[0][1]+' '+x[0][2]+' '+x[0][3] for x in fdist_fourgrams]
fdist_fivegrams=[x[0][0]+' '+x[0][1]+' '+x[0][2]+' '+x[0][3]+' '+x[0][4]  for x in fdist_fivegrams]

#next 4 lines create a single list with important ngrams
n_grams=fdist_bigrams
n_grams.extend(fdist_trigrams)
n_grams.extend(fdist_fourgrams)
n_grams.extend(fdist_fivegrams)

This section lets you add your own ngrams to a list

#Another option here would be to make your own list of the ones you want
#in this example I add some user ngrams to the ones found above
user_grams=['ngram1 I like', 'ngram 2', 'another ngram I like a lot']
user_grams=[x.lower() for x in user_grams]    

n_grams.extend(user_grams)

And this last part performs the processing so that you can tokenize again and get the ngrams as tokens.

#initialize the corpus that will have combined ngrams
corpus_ngrams=corpus

#here we go through the ngrams we found and replace them in the corpus with
#version connected with dashes. That way we can find them when we tokenize.
for gram in n_grams:
    gram_r=gram.replace(' ','-')
    corpus_ngrams=corpus_ngrams.replace(gram, gram.replace(' ','-'))

#retokenize the new corpus so we can find the ngrams
corpus_ngrams_tokens= nltk.word_tokenize(corpus_ngrams)

print(corpus_ngrams_tokens)

Out: ['a-big-tantrum', 'runs-in-my-family', '4x', 'a', 'day', ',', 'every-week', '.', 'a-big-tantrum', 'is', 'lame', '.', 'a-big-tantrum', 'causes', 'strife', '.', 'it', 'runs-in-my-family', 'because', 'of', 'our', 'complicated', 'history', '.', 'every-week', 'is', 'a', 'lot', 'though', '.', 'every-week', 'i', 'dread', 'the', 'tantrum', '.', 'every-week', '...']

I think this is actually a very good question.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...