For a data science project I am tasked with the cleanup of our twitter data. The tweets contain unicode encoded emojis (and other stuff) in the form of ud83dudcf8
(camera emoji) or ud83cuddebud83cuddf7
(french flag) for example.
I am using the python-package "re" and so far I was successful in removing "simple" unicodes like u201c
(double quotation mark) with something like
text = re.sub(u'u201c', '', text)
However, when I am trying to remove more complex structures, like for example
text = re.sub(u'ud83dudcf8', '', text) # remove camera emoji
text = re.sub(u'ud83cuddebud83cuddf7', '', text) # remove french flag emoji
nothing is happening, no matter if I prefix the string with an 'u', an 'r' or nothing at all. The unicode remains in the string.
EDIT:
Thanks to @Shawn Shroyer's answer i found out that
text = re.sub(u'\ud83d\udcf8', '', text)
works fine! I just had to escape the backslashes. Now only my second problem remains (see below).
The second problem is that I don't want to have to specify every single emoji individually, but instead I would like to remove them all in a much simpler fashion, but without removing ALL unicode characters, because I need to retain stuff like u2019
(single quotation mark).
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…