I am trying to practice a classification task on NLP. I am using 20newsgroup dataset and I want to implement a classification model. Before training model, I want to implement:
- stopword removal
- punctuation removal
- converting to lower case - since it's not Sentiment analysis task, so case distinction doesn't matter here according to me.
I am using the following code:
max_len = 0
for sent in x_train:
tokenizer_out = tokenizer(sent)
# convert numerical tokens to alphabetical tokens
encoded_tok = tokenizer.convert_ids_to_tokens(tokenizer_out.input_ids)
tokens_without_sw = [word for word in encoded_tok if not word in stopwords.words()]
new_ids = tokenizer.convert_tokens_to_ids(tokens_without_sw)
max_len = max(max_len, len(new_ids))
I will be using pretrained BERT from hugging face. And before implementing the code above, I had done the following to remove unnecessary lines:
def clean(post: str, remove_it: tuple):
new_lines = []
for line in post.splitlines():
if not line.startswith(remove_it):
new_lines.append(line)
return '
'.join(new_lines)
remove_it = (
'From:',
'Subject:',
'Reply-To:',
'In-Reply-To:',
'Nntp-Posting-Host:',
'Organization:',
'X-Mailer:',
'In article <',
'Lines:',
'NNTP-Posting-Host:',
'Summary:',
'Article-I.D.:'
)
x_train = [clean(p, remove_it) for p in x_train]
x_test = [clean(p, remove_it) for p in x_test]
My next goal is to clean it further. With my classification, I am able to achieve 90% accuracy but I want to increase it further. SO, I want to remove the stopwords and punctuations, convert to lower case and see what happens. But with the code I use, its taking like forever to run, so I want a faster approach.
Can anyone help me?
question from:
https://stackoverflow.com/questions/65651681/how-to-clean-the-20newsgroup-dataset-for-nlp-tasks 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…