Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
314 views
in Technique[技术] by (71.8m points)

glyph - Find characters that are similar glyphically in Unicode?

Lets say I have the characters ú, ù, ü. All of them are similar glyphically to the English U.

Is there some list or algorithm to do this:

  • Given a ú or ù or ü return the English U
  • Given a English U, return the list of all U-similar characters

I'm not sure if the code point of the Unicode characters is the same across all fonts? If it is, I suppose there could be some easy way and efficient to do this?

UPDATE

If you're using Ruby, there is a gem available unicode-confusable for this that may help in some cases.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

It is very unclear what you are asking to do here.

  • There are characters whose canonical decompositions all start with the same base character: e, é, ê, ?, ē, ?, ?, ?, ě, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, e?, … or s, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ….

  • There are characters whose compatibility decompositions all include a particular character: ?, ?, ?, ?, ?, ?, ?, ?, e, … or s, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, s, … or R, ?, ?, ?, ?, ?, ?, ?, R, ….

  • There are characters that just happen to look alike in some fonts: ? and β and ?, or 3 and ? and ? and ? and ? and ? and ?, or ? and ? and γ, or F and ? and ?, or B and Β and В, or ? and ○ and 0 and O and ? and ? and ? and ?, or 1 and l and I and Ⅰ and ? and | and ? and ∣, ….

  • Characters that are the same case-insensitively, like s and S and ?, or ss and Ss and SS and ? and ?, ….

  • Characters that all have the same numeric value, like all these for the value 1: 11?????????????????????????????????? ① ⑴ ⒈ ? ??????????????????????????????????????????????????????????????? ?? Ⅰⅰ?一㈠一????.

  • Characters that all have the same primary collation strength, like all these that are the same as d: DdDe??????????????????????????? ? ? ??Dd???????????????????????????????????????????????????? ?? ?? ?? ?? . Note that some of those are not accessible through any kind of decomposition, but only through the DUCET/UCA values; for example, the fairly common e or the newish ? can be equated to d only through a primary UCA strength comparison; same with ? and z, ? and c, etc.

  • Characters that are same in certain locales, like ? and ae, or ? and ae, or ? and aa, or MacKinley and McKinley, …. Note that locale can make a really big difference, since in some locales both c and ? are the same character while in others they are not; similarly for n and ?, or a and á and ?, ….

Some of these can be handled. Some cannot. All require different approaches depending on different needs.

What is your real goal?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...