python - Remove unicode encoded emojis from Twitter tweet

Question

Welcome To Ask or Share your Answers For Others

python - Remove unicode encoded emojis from Twitter tweet

posted Feb 19, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Remove unicode encoded emojis from Twitter tweet

For a data science project I am tasked with the cleanup of our twitter data. The tweets contain unicode encoded emojis (and other stuff) in the form of ud83dudcf8 (camera emoji) or ud83cuddebud83cuddf7 (french flag) for example.

I am using the python-package "re" and so far I was successful in removing "simple" unicodes like u201c (double quotation mark) with something like

text = re.sub(u'u201c', '', text)

However, when I am trying to remove more complex structures, like for example

text = re.sub(u'ud83dudcf8', '', text) # remove camera emoji
text = re.sub(u'ud83cuddebud83cuddf7', '', text) # remove french flag emoji

nothing is happening, no matter if I prefix the string with an 'u', an 'r' or nothing at all. The unicode remains in the string.

EDIT: Thanks to @Shawn Shroyer's answer i found out that

text = re.sub(u'\ud83d\udcf8', '', text)

works fine! I just had to escape the backslashes. Now only my second problem remains (see below).

The second problem is that I don't want to have to specify every single emoji individually, but instead I would like to remove them all in a much simpler fashion, but without removing ALL unicode characters, because I need to retain stuff like u2019 (single quotation mark).

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-02-19T04:10:48+0000

My suggestion would be to create an array of values you would like to replace and you need to escape the by adding another backslash, or adding 'ur' before your string so backslashes do not need to be escaped.

import re
to_remove_arr = [u"ud83dudcf8", u"ud83cuddebud83cuddf7"]
pattern_str = "|".join(to_remove_arr)    
text = re.sub(pattern_str, "", text)

Edit: the above solution will remove specific unicode characters - to remove all non-ASCII Unicode characters:

text = text.encode("ascii", "ignore").decode()

Edit: to remove only emojis I found:

def strip_emoji(text):
    RE_EMOJI = re.compile(u'([U00002600-U000027BF])|([U0001f300-U0001f64F])|([U0001f680-U0001f6FF])')
    return RE_EMOJI.sub(r'', text)

Categories

python - Remove unicode encoded emojis from Twitter tweet

python - Remove unicode encoded emojis from Twitter tweet

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags