Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
87 views
in Technique[技术] by (71.8m points)

python - How to clean the 20newsgroup dataset for nlp tasks

I am trying to practice a classification task on NLP. I am using 20newsgroup dataset and I want to implement a classification model. Before training model, I want to implement:

  1. stopword removal
  2. punctuation removal
  3. converting to lower case - since it's not Sentiment analysis task, so case distinction doesn't matter here according to me.

I am using the following code:

max_len = 0
for sent in x_train:

    tokenizer_out = tokenizer(sent)
    # convert numerical tokens to alphabetical tokens
    encoded_tok = tokenizer.convert_ids_to_tokens(tokenizer_out.input_ids)
    tokens_without_sw = [word for word in encoded_tok if not word in stopwords.words()]
    new_ids = tokenizer.convert_tokens_to_ids(tokens_without_sw)
    max_len = max(max_len, len(new_ids))

I will be using pretrained BERT from hugging face. And before implementing the code above, I had done the following to remove unnecessary lines:

def clean(post: str, remove_it: tuple):
  new_lines = []
  for line in post.splitlines():
        if not line.startswith(remove_it):
            new_lines.append(line)
  return '
'.join(new_lines)

remove_it = (
      'From:',
      'Subject:',
      'Reply-To:',
      'In-Reply-To:',
      'Nntp-Posting-Host:',
      'Organization:',
      'X-Mailer:',
      'In article <',
      'Lines:',
      'NNTP-Posting-Host:',
      'Summary:',
      'Article-I.D.:'
  )
x_train = [clean(p, remove_it) for p in x_train]
x_test = [clean(p, remove_it) for p in x_test]

My next goal is to clean it further. With my classification, I am able to achieve 90% accuracy but I want to increase it further. SO, I want to remove the stopwords and punctuations, convert to lower case and see what happens. But with the code I use, its taking like forever to run, so I want a faster approach.

Can anyone help me?

question from:https://stackoverflow.com/questions/65651681/how-to-clean-the-20newsgroup-dataset-for-nlp-tasks

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
Waitting for answers

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...