Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
596 views
in Technique[技术] by (71.8m points)

php - Will [a-z] ever match accented characters in PREG/PCRE?

I'm already aware that w in PCRE (particularly PHP's implementation) can sometimes match some non-ASCII characters depending on the locale of the system, but what about [a-z]?

I wouldn't think so, but I noticed these lines in one of Drupal's core files (includes/theme.inc, simplified):

// To avoid illegal characters in the class,
// we're removing everything disallowed. We are not using 'a-z' as that might leave
// in certain international characters (e.g. German umlauts).
$body_classes[] = preg_replace('![^abcdefghijklmnopqrstuvwxyz0-9-_]+!s', '', $class);

Is this true, or did someone simply get [a-z] confused with w?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Long story short: Maybe, depends on the system the app is deployed to, depends how PHP was compiled, welcome to the CF of localization and internationalization.

The underlying PCRE engine takes locale into account when determining what "a-z" means. In a Spanish based locale, ? would be caught by a-z). The semantic meaning of a-z is "all the letters between a and z, and ? is a separate letter in Spanish.

However, the way PHP blindly handles strings as collections of bytes rather than a collection of UTF code points means you have a situation where a-z MIGHT match an accented character. Given the variety of different systems Drupal gets deployed to, it makes sense that they would choose to be explicit about the allowed characters rather than just trust a-z to do the right thing.

I'd also conjecture that the existence of this regular expression is the result of a bug report being filed about German umlauts not being filtered.

Update in 2014: Per JimmiTh's answer below, it looks like (despite some "confusing-to-non-pcre-core-developers" documentation) that [a-z] will only match the characters abcdefghijklmnopqrstuvwxyz a proverbial 99% of the time. That said —?framework developers tend to get twitchy about vagueness in their code, especially when the code relies on systems (locale specific strings) that PHP doesn't handle as gracefully as you'd like, and servers the developers have no control over. While the anonymous Drupal developer's comments are incorrect — it wasn't a matter of "getting [a-z] confused with w", but instead a Drupal developer being unclear/unsure of how PCRE handled [a-z], and choosing the more specific form of abcdefghijklmnopqrstuvwxyz to ensure the specific behavior they wanted.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...