Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
447 views
in Technique[技术] by (71.8m points)

python - Remove unicode encoded emojis from Twitter tweet

For a data science project I am tasked with the cleanup of our twitter data. The tweets contain unicode encoded emojis (and other stuff) in the form of ud83dudcf8 (camera emoji) or ud83cuddebud83cuddf7 (french flag) for example.

I am using the python-package "re" and so far I was successful in removing "simple" unicodes like u201c (double quotation mark) with something like

text = re.sub(u'u201c', '', text)

However, when I am trying to remove more complex structures, like for example

text = re.sub(u'ud83dudcf8', '', text) # remove camera emoji
text = re.sub(u'ud83cuddebud83cuddf7', '', text) # remove french flag emoji

nothing is happening, no matter if I prefix the string with an 'u', an 'r' or nothing at all. The unicode remains in the string.

EDIT: Thanks to @Shawn Shroyer's answer i found out that

text = re.sub(u'\ud83d\udcf8', '', text)

works fine! I just had to escape the backslashes. Now only my second problem remains (see below).

The second problem is that I don't want to have to specify every single emoji individually, but instead I would like to remove them all in a much simpler fashion, but without removing ALL unicode characters, because I need to retain stuff like u2019 (single quotation mark).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

My suggestion would be to create an array of values you would like to replace and you need to escape the by adding another backslash, or adding 'ur' before your string so backslashes do not need to be escaped.

import re
to_remove_arr = [u"ud83dudcf8", u"ud83cuddebud83cuddf7"]
pattern_str = "|".join(to_remove_arr)    
text = re.sub(pattern_str, "", text)

Edit: the above solution will remove specific unicode characters - to remove all non-ASCII Unicode characters:

text = text.encode("ascii", "ignore").decode()

Edit: to remove only emojis I found:

def strip_emoji(text):
    RE_EMOJI = re.compile(u'([U00002600-U000027BF])|([U0001f300-U0001f64F])|([U0001f680-U0001f6FF])')
    return RE_EMOJI.sub(r'', text)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...