python - How to clean the 20newsgroup dataset for nlp tasks

Question

Welcome To Ask or Share your Answers For Others

python - How to clean the 20newsgroup dataset for nlp tasks

posted Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - How to clean the 20newsgroup dataset for nlp tasks

I am trying to practice a classification task on NLP. I am using 20newsgroup dataset and I want to implement a classification model. Before training model, I want to implement:

stopword removal
punctuation removal
converting to lower case - since it's not Sentiment analysis task, so case distinction doesn't matter here according to me.

I am using the following code:

max_len = 0
for sent in x_train:

    tokenizer_out = tokenizer(sent)
    # convert numerical tokens to alphabetical tokens
    encoded_tok = tokenizer.convert_ids_to_tokens(tokenizer_out.input_ids)
    tokens_without_sw = [word for word in encoded_tok if not word in stopwords.words()]
    new_ids = tokenizer.convert_tokens_to_ids(tokens_without_sw)
    max_len = max(max_len, len(new_ids))

I will be using pretrained BERT from hugging face. And before implementing the code above, I had done the following to remove unnecessary lines:

def clean(post: str, remove_it: tuple):
  new_lines = []
  for line in post.splitlines():
        if not line.startswith(remove_it):
            new_lines.append(line)
  return '
'.join(new_lines)

remove_it = (
      'From:',
      'Subject:',
      'Reply-To:',
      'In-Reply-To:',
      'Nntp-Posting-Host:',
      'Organization:',
      'X-Mailer:',
      'In article <',
      'Lines:',
      'NNTP-Posting-Host:',
      'Summary:',
      'Article-I.D.:'
  )
x_train = [clean(p, remove_it) for p in x_train]
x_test = [clean(p, remove_it) for p in x_test]

My next goal is to clean it further. With my classification, I am able to achieve 90% accuracy but I want to increase it further. SO, I want to remove the stopwords and punctuations, convert to lower case and see what happens. But with the code I use, its taking like forever to run, so I want a faster approach.

Can anyone help me?

question from:https://stackoverflow.com/questions/65651681/how-to-clean-the-20newsgroup-dataset-for-nlp-tasks

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

Categories

python - How to clean the 20newsgroup dataset for nlp tasks

python - How to clean the 20newsgroup dataset for nlp tasks

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags