Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
504 views
in Technique[技术] by (71.8m points)

nlp - Extract most important keywords from a set of documents

I have a set of 3000 text documents and I want to extract top 300 keywords (could be single word or multiple words).

I have tried the below approaches -

RAKE: It is a Python based keyword extraction library and it failed miserably.

Tf-Idf: It has given me good keywords per document, but it is not able to aggregate them and find keywords that represent the whole group of documents. Also, just selecting top k words from each document based on Tf-Idf score won't help, right?

Word2vec: I was able to do some cool stuff like find similar words but not sure how to find important keywords using it.

Can you please suggest some good approach (or elaborate on how to improve any of the above 3) to solve this problem? Thanks :)

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Although Latent Dirichlet allocation and Hierarchical Dirichlet Process are typically used to derive topics within a text corpus and then use these topics to classify individual entries, a method to derive keywords for the entire corpus can also be developed. This method benefits from not relying on another text corpus. A basic workflow would be to compare these Dirichlet keywords to the most common words to see if LDA or HDP is able to pick up on important words that are not included in the most common ones.

Before using the following codes, it’s generally suggested that the following is done for text preprocessing:

  1. Remove punctuation from the texts (see string.punctuation)
  2. Convert the string texts to "tokens" (str.split(‘ ’).lower() to individual words)
  3. Remove numbers and stop words (see stopwordsiso or stop_words)
  4. Create bigrams - combinations of words in the text that appear together often (see gensim.Phrases)
  5. Lemmatize tokens - converting words to their base forms (see spacy or NLTK)
  6. Remove tokens that aren’t frequent enough (or too frequent, but in this case skip removing the too frequent ones, as these would be good keywords)

These steps would create the variable corpus in the following. A good overview of all this with an explanation of LDA can be found here.

Now for LDA and HDP with gensim:

from gensim.models import LdaModel, HdpModel
from gensim import corpora

First create a dirichlet dictionary that maps the words in corpus to indexes, and then use this to create a bag of words where the tokens within corpus are replaced by their indexes. This is done via:

dirichlet_dict = corpora.Dictionary(corpus)
bow_corpus = [dirichlet_dict.doc2bow(text) for text in corpus]

For LDA, the optimal number of topics needs to derived, which can be heuristically done through the method in this answer. Let's assume that our optimal number of topics is 10, and as per the question we want 300 keywords:

num_topics = 10
num_keywords = 300

Create an LDA model:

dirichlet_model = LdaModel(corpus=bow_corpus,
                           id2word=dirichlet_dict,
                           num_topics=num_topics,
                           update_every=1,
                           chunksize=len(bow_corpus),
                           passes=20,
                           alpha='auto')

Next comes a function to derive the best topics based on their average coherence across the corpus. First an ordered lists for the most important words per topic will be produced; then the average coherence of each topic to the whole corpus is found; and finally topics are ordered based on this average coherence and returned along with a list of the averages to be used later. The code for all this is as follows (includes the option to use HDP from below):

def order_subset_by_coherence(dirichlet_model, bow_corpus, num_topics=10, num_keywords=10):
    """
    Orders topics based on their average coherence across the corpus

    Parameters
    ----------
        dirichlet_model : gensim.models.type_of_model
        bow_corpus : list of lists (contains (id, freq) tuples)
        num_topics : int (default=10)
        num_keywords : int (default=10)

    Returns
    -------
        ordered_topics, ordered_topic_averages: list of lists and list
    """
    if type(dirichlet_model) == gensim.models.ldamodel.LdaModel:
        shown_topics = dirichlet_model.show_topics(num_topics=num_topics, 
                                                   num_words=num_keywords,
                                                   formatted=False)
    elif type(dirichlet_model)  == gensim.models.hdpmodel.HdpModel:
        shown_topics = dirichlet_model.show_topics(num_topics=150, # return all topics
                                                   num_words=num_keywords,
                                                   formatted=False)
    model_topics = [[word[0] for word in topic[1]] for topic in shown_topics]
    topic_corpus = dirichlet_model.__getitem__(bow=bow_corpus, eps=0) # cutoff probability to 0 

    topics_per_response = [response for response in topic_corpus]
    flat_topic_coherences = [item for sublist in topics_per_response for item in sublist]

    significant_topics = list(set([t_c[0] for t_c in flat_topic_coherences])) # those that appear
    topic_averages = [sum([t_c[1] for t_c in flat_topic_coherences if t_c[0] == topic_num]) / len(bow_corpus) 
                      for topic_num in significant_topics]

    topic_indexes_by_avg_coherence = [tup[0] for tup in sorted(enumerate(topic_averages), key=lambda i:i[1])[::-1]]

    significant_topics_by_avg_coherence = [significant_topics[i] for i in topic_indexes_by_avg_coherence]
    ordered_topics = [model_topics[i] for i in significant_topics_by_avg_coherence][:num_topics] # limit for HDP

    ordered_topic_averages = [topic_averages[i] for i in topic_indexes_by_avg_coherence][:num_topics] # limit for HDP
    ordered_topic_averages = [a/sum(ordered_topic_averages) for a in ordered_topic_averages] # normalize HDP values

    return ordered_topics, ordered_topic_averages

Now to get a list of keywords - the most important words across the topics. This is done by subsetting the words (which again are ordered by significance by default) from each of the ordered topics based on their average coherence to the whole. To explain explicitly, assume that there are just have two topics, and the texts are 70% coherent to the first, and 30% to the second. Keywords could then be the top 70% of words from the first topic, and the top 30% from the second that have not already been selected. This is achieved via the following:

ordered_topics, ordered_topic_averages = 
    order_subset_by_coherence(dirichlet_model=dirichlet_model,
                              bow_corpus=bow_corpus, 
                              num_topics=num_topics,
                              num_keywords=num_keywords)

keywords = []
for i in range(num_topics):
    # Find the number of indexes to select, which can later be extended if the word has already been selected
    selection_indexes = list(range(int(round(num_keywords * ordered_topic_averages[i]))))
    if selection_indexes == [] and len(keywords) < num_keywords: 
        # Fix potential rounding error by giving this topic one selection
        selection_indexes = [0]
              
    for s_i in selection_indexes:
        if ordered_topics[i][s_i] not in keywords and ordered_topics[i][s_i] not in ignore_words:
            keywords.append(ordered_topics[i][s_i])
        else:
            selection_indexes.append(selection_indexes[-1] + 1)

# Fix for if too many were selected
keywords = keywords[:num_keywords]

The above also includes the variable ignore_words, which is a list of words that should not be included in the results.

For HDP the model follows a similar process to the above, except that num_topics and other arguments do not need to be passed in model creation. HDP derives optimal topics itself, but then these topics will need to be ordered and subsetted using order_subset_by_coherence to assure that the best topics are used for a finite selection. A model is created via:

dirichlet_model = HdpModel(corpus=bow_corpus, 
                           id2word=dirichlet_dict,
                           chunksize=len(bow_corpus))

It is best to test both LDA and HDP, as LDA can outperform based on the needs of the problem if a suitable number of topics is able to be found (this is still the standard over HDP). Compare the Dirichlet keywords to word frequencies alone, and hopefully what's generated is a list of keywords that are more related to the overall theme of the text, not simply the words that are most common.

Obviously selecting ordered words from topics based on percent text coherence doesn’t give an overall ordering of the keywords by importance, as some words that are very important in topics with less overall coherence will be selected later.

The process for using LDA to generate keywords for the individual texts within the corpus can be found in this answer.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...