Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
480 views
in Technique[技术] by (71.8m points)

regex - How to find and count emoticons in a string using python?

This topic has been addressed for text based emoticons at link1, link2, link3. However, I would like to do something slightly different than matching simple emoticons. I'm sorting through tweets that contain the emoticons' icons. The following unicode information contains just such emoticons: pdf.

Using a string with english words that also contains any of these emoticons from the pdf, I would like to be able to compare the number of emoticons to the number of words.

The direction that I was heading down doesn't seem to be the best option and I was looking for some help. As you can see in the script below, I was just planning to do the work from the command line:

$cat <file containing the strings with emoticons> | ./emo.py

emo.py psuedo script:

import re
import sys

for row in sys.stdin:
    print row.decode('utf-8').encode("ascii","replace")
    #insert regex to find the emoticons
    if match:
       #do some counting using .split(" ")
       #print the counting

The problem that I'm running into is the decoding/encoding. I haven't found a good option for how to encode/decode the string so I can correctly find the icons. An example of the string that I want to search to find the number of words and emoticons is as follows:

"Smiley emoticon rocks!enter image description here I like youenter image description here."

The challenge: can you make a script that counts the number of words and emoticons in this string? Notice that the emoticons are both sitting next to the words with no space in between.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

First, there is no need to encode here at all. You're got a Unicode string, and the re engine can handle Unicode, so just use it.

A character class can include a range of characters, by specifying the first and last with a hyphen in between. And you can specify Unicode characters that you don't know how to type with U escape sequences. So:

import re

s=u"Smiley emoticon rocks!U0001f600 I like you.U0001f601"
count = len(re.findall(ru'[U0001f600-U0001f650]', s))

Or, if the string is big enough that building up the whole findall list seems wasteful:

emoticons = re.finditer(ru'[U0001f600-U0001f650]', s)
count = sum(1 for _ in emoticons)

Counting words, you can do separately:

wordcount = len(s.split())

If you want to do it all at once, you can use an alternation group:

word_and_emoticon_count = len(re.findall(ru'w+|[U0001f600-U0001f650]', s))

As @strangefeatures points out, Python versions before 3.3 allowed "narrow Unicode" builds. And, for example, most CPython Windows builds are narrow. In narrow builds, characters can only be in the range U+0000 to U+FFFF. There's no way to search for these characters, but that's OK, because they're don't exist to search for; you can just assume they don't exist if you get an "invalid range" error compiling the regexp.

Except, of course, that there's a good chance that wherever you're getting your actual strings from, they're UTF-16-BE or UTF-16-LE, so the characters do exist, they're just encoded into surrogate pairs. And you want to match those surrogate pairs, right? So you need to translate your search into a surrogate-pair search. That is, convert your high and low code points into surrogate pair code units, then (in Python terms) search for:

(lead == low_lead and lead != high_lead and low_trail <= trail <= DFFF or
 lead == high_lead and lead != low_lead and DC00 <= trail <= high_trail or
 low_lead < lead < high_lead and DC00 <= trail <= DFFF)

You can leave off the second condition in the last case if you're not worried about accepting bogus UTF-16.

If it's not obvious how that translates into regexp, here's an example for the range [U0001e050-U0001fbbf] in UTF-16-BE:

(ud838[udc50-udfff])|([ud839-ud83d].)|(ud83e[udc00-udfbf])

Of course if your range is small enough that low_lead == high_lead this gets simpler. For example, the original question's range can be searched with:

ud83d[ude00-ude50]

One last trick, if you don't actually know whether you're going to get UTF-16-LE or UTF-16-BE (and the BOM is far away from the data you're searching): Because no surrogate lead or trail code unit is valid as a standalone character or as the other end of a pair, you can just search in both directions:

(ud838[udc50-udfff])|([ud839-ud83d][udc00-udfff])|(ud83e[udc00-udfbf])|
([udc50-udfff]ud838)|([udc00-udfff][ud839-ud83d])|([udc00-udfbf]ud83e)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...