Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
860 views
in Technique[技术] by (71.8m points)

ruby - Scanning for Unicode Numbers in a string with d

According to the Oniguruma documentation, the d character type matches:

decimal digit char
Unicode: General_Category -- Decimal_Number

However, scanning for d in a string with all the Decimal_Number characters results in only latin 0-9 digits being matched:

#encoding: utf-8
require 'open-uri'
html = open("http://www.fileformat.info/info/unicode/category/Nd/list.htm").read
digits = html.scan(/U+([da-f]{4})/i).flatten.map{ |s| s.to_i(16) }.pack('U*')

puts digits.encoding, digits
#=> UTF-8
#=> 0123456789?????????????????????????????????????????????????????…

p RUBY_DESCRIPTION, digits.scan(/d/)
#=> "ruby 1.9.2p180 (2011-02-18) [i386-mingw32]"
#=> ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]

Am I misreading the documentation? Why doesn't d match other Unicode numerals, and/or is there a way to make it do so?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Noted by Brian Candler on ruby-talk:

  • w only matches ASCII letters and digits, while [[:alpha:]] matches the full set of Unicode letters.
  • d only matches ASCII digits, while [[:digit:]] matches the full set of Unicode numbers.

The behavior is thus 'consistent', and we have a simple workaround for Unicode numbers. Reading up on w in the same Oniguruma doc we see the text:

w  word character  
    Not Unicode: alphanumeric, "_" and multibyte char.  
    Unicode: General_Category -- (Letter|Mark|Number|Connector_Punctuation)

In light of the real behavior of Ruby and the "Not Unicode" text above, it would appear that the documentation is describing two modes—a Unicode mode and a Not Unicode mode—and that Ruby is operating in the Not Unicode mode.

This would explain why d does not match the full Unicode set: although the Oniguruma documentation fails to describe exactly what is matched when in Not Unicode mode, we now know that the behavior documented as "Unicode" is not to be expected.

p "ab?".scan(/w/), "ab?".scan(/[[:alpha:]]/)
#=> ["a", "b"]
#=> ["a", "b", "u00E7"]

It is left as an exercise to the reader to discover how (if at all) to enable Unicode mode in Ruby regexps, as the /u flag (e.g. /w/u) does not do it. (Perhaps Ruby must be recompiled with a special flag for Oniguruma.)

Update: It would appear that the Oniguruma document I have linked to is not accurate for Ruby 1.9. See this ticket discussion, including these posts:

[Yui NARUSE] "RE.txt is for original Oniguruma, not for Ruby 1.9's regexp. We may need our own document."
[Matz] "Our Oniguruma is forked one. The original Oniguruma found in geocities.jp has not been changed."

Better Reference: Here is official documentation on Ruby 1.9's regexp syntax:
https://github.com/ruby/ruby/blob/trunk/doc/re.rdoc


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...