Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
540 views
in Technique[技术] by (71.8m points)

regex - Weird backslash substitution in Ruby

I don't understand this Ruby code:

>> puts '\ <- single backslash'
#  <- single backslash

>> puts '\ <- 2x a, because 2 backslashes get replaced'.sub(/\/, 'aa')
# aa <- 2x a, because two backslashes get replaced

so far, all as expected. but if we search for 1 with /\/, and replace with 2, encoded by '\\', why do we get this:

>> puts '\ <- only 1 ... replace 1 with 2'.sub(/\/, '\\')
#  <- only 1 backslash, even though we replace 1 with 2

and then, when we encode 3 with '\\\', we only get 2:

>> puts '\ <- only 2 ... 1 with 3'.sub(/\/, '\\\')
# \ <- 2 backslashes, even though we replace 1 with 3

anyone able to understand why a backslash gets swallowed in the replacement string? this happens on 1.8 and 1.9.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Quick Answer

If you want to sidestep all this confusion, use the much less confusing block syntax. Here is an example that replaces each backslash with 2 backslashes:

"some\path".gsub('\') { '\\' }

Gruesome Details

The problem is that when using sub (and gsub), without a block, ruby interprets special character sequences in the replacement parameter. Unfortunately, sub uses the backslash as the escape character for these:

& (the entire regex)
+ (the last group)
` (pre-match string)
' (post-match string)
 (same as &)
1 (first captured group)
2 (second captured group)
\ (a backslash)

Like any escaping, this creates an obvious problem. If you want include the literal value of one of the above sequences (e.g. 1) in the output string you have to escape it. So, to get Hello 1, you need the replacement string to be Hello \1. And to represent this as a string literal in Ruby, you have to escape those backslashes again like this: "Hello \\1"

So, there are two different escaping passes. The first one takes the string literal and creates the internal string value. The second takes that internal string value and replaces the sequences above with the matching data.

If a backslash is not followed by a character that matches one of the above sequences, then the backslash (and character that follows) will pass through unaltered. This is also affects a backslash at the end of the string -- it will pass through unaltered. It's easiest to see this logic in the rubinius code; just look for the to_sub_replacement method in the String class.

Here are some examples of how String#sub is parsing the replacement string:

  • 1 backslash (which has a string literal of "")

    Passes through unaltered because the backslash is at the end of the string and has no characters after it.

    Result:

  • 2 backslashes \ (which have a string literal of "")

    The pair of backslashes match the escaped backslash sequence (see \ above) and gets converted into a single backslash.

    Result:

  • 3 backslashes \ (which have a string literal of "")

    The first two backslashes match the \ sequence and get converted to a single backslash. Then the final backslash is at the end of the string so it passes through unaltered.

    Result: \

  • 4 backslashes \\ (which have a string literal of "")

    Two pairs of backslashes each match the \ sequence and get converted to a single backslash.

    Result: \

  • 2 backslashes with character in the middle a (which have a string literal of "\a")

    The a does not match any of the escape sequences so it is allowed to pass through unaltered. The trailing backslash is also allowed through.

    Result: a

    Note: The same result could be obtained from: \a\ (with the literal string: "\\a")

In hindsight, this could have been less confusing if String#sub had used a different escape character. Then there wouldn't be the need to double escape all the backslashes.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...