Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
930 views
in Technique[技术] by (71.8m points)

regex - replace emoji unicode symbol using regexp in javascript

As you all know emoji symbols are coded up to 3 or 4 bytes, so it may occupy 2 symbols in my string. For example '??wew??'.length = 7 I want to find those symbols in my text and replace them to the value that is dependent from its code. Reading SO, I came up to XRegExp library with unicode plugin, but have not found the way how to make it work.

var str = '??wew??';// u1F601 symbol
var reg = XRegExp('[u1F601-u1F64F]', 'g'); //  /[?1-?F]/g -doesn't make a lot of sense  
//var reg = XRegExp('[uD83DuDE01-uD83DuDE4F]', 'g'); //Range out of order in character class
//var reg = XRegExp('\p{L}', 'g'); //doesn't match my symbols
console.log(XRegExp.replace(str, reg, function(match){
   return encodeURIComponent(match);// here I want to have smth like that %F0%9F%98%84 to be able to map anything I want to this value and replace to it
}));

jsfiddle

I really don't want to bruteforce the string looking for the sequence of characters from my range. Could someone help me to find the way to do that with regexp's.

EDITED Just came up with an idea of enumerating all the emoji symbols. Better than brutforce but still looking for the better idea

var reg = XRegExp('uD83DuDE01|uD83DuDE4F|...','g');
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The u.... notation has four hex digits, no less, no more, so it can only represent code points up to U+FFFF. Unicode characters above that are represented as pairs of surrogate code points.

So some indirect approach is needed. Cf. to JavaScript strings outside of the BMP.

For example, you could look for code points in the range [uD800-uDBFF] (high surrogates), and when you find one, check that the next code point in the string is in the range [uDC00-uDFFF] (if not, there is a serious data error), interpret the two as a Unicode character, and replace them by whatever you wish to put there. This looks like a job for a simple loop through the string, rather than a regular expression.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...