Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
494 views
in Technique[技术] by (71.8m points)

regex - Java String.split() sometimes giving blank strings

I'm making a text based dice roller. It takes in strings like "2d10+5" and returns a string as a result of the roll(s). My problem is showing up in the tokenizer that splits the string into useful parts for me to parse into information.

String[] tokens = message.split("(?=[dk\+\-])");

This is yielding strange, unexpected results. I don't know exactly what is causing them. It could be the regex, my misunderstanding, or Java just being Java. Here's what's happening:

  • 3d6+4 yields the string array [3, d6, +4]. This is correct.
  • d% yields the string array [d%]. This is correct.
  • d20 yields the string array [d20]. This is correct.
  • d%+3 yields the string array [, d%, +3]. This is incorrect.
  • d20+2 yields the string array [, d20, +2]. This is incorrect.

In the fourth and fifth example, something strange is causing an extra empty string to appear at the front of the array. It's not the lack of number at the front of the string, as other examples disprove that. It's not the presence of the percentage sign, nor the the plus sign.

For now I'm just continuing through the for loop on blank strings, but that feels sorta like a band-aid solution. Does anyone have any idea what causes the blank string at the front of the array? How can I fix it?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Digging through the source code, I got the exact issue behind this behaviour.

The String.split() method internally uses Pattern.split(). The split method before returning the resulting array checks for the last matched index or if there is actually a match. If the last matched index is 0, that means, your pattern matched just an empty string at the beginning of the string or didn't match at all, in which case, the returned array is a single element array containing the same element.

Here's the source code:

public String[] split(CharSequence input, int limit) {
        int index = 0;
        boolean matchLimited = limit > 0;
        ArrayList<String> matchList = new ArrayList<String>();
        Matcher m = matcher(input);

        // Add segments before each match found
        while(m.find()) {
            if (!matchLimited || matchList.size() < limit - 1) {
                String match = input.subSequence(index, m.start()).toString();
                matchList.add(match);

                // Consider this assignment. For a single empty string match
                // m.end() will be 0, and hence index will also be 0
                index = m.end();
            } else if (matchList.size() == limit - 1) { // last one
                String match = input.subSequence(index,
                                                 input.length()).toString();
                matchList.add(match);
                index = m.end();
            }
        }

        // If no match was found, return this
        if (index == 0)
            return new String[] {input.toString()};

        // Rest of them is not required

If the last condition in the above code - index == 0, is true, then the single element array is returned with the input string.

Now, consider the cases when the index can be 0.

  1. When there is no match at all. (As already in the comment above that condition)
  2. If the match is found at the beginning, and the length of matched string is 0, then the value of index in the if block (inside the while loop) -

    index = m.end();
    

    will be 0. The only possible match string is an empty string (length = 0). Which is exactly the case here. And also there shouldn't be any further matches, else index would be updated to a different index.

So, considering your cases:

  • For d%, there is just a single match for the pattern, before the first d. Hence the index value would be 0. But since there isn't any further matches, the index value is not updated, and the if condition becomes true, and returns the single element array with original string.

  • For d20+2 there would be two matches, one before d, and one before +. So index value will be updated, and hence the ArrayList in the above code will be returned, which contains the empty string as a result of split on delimiter which is the first character of the string, as already explained in @Stema's answer.

So, to get the behaviour you want (that is split on delimiter only when it is not at the beginning, you can add a negative look-behind in your regex pattern):

"(?<!^)(?=[dk+-])"  // You don't need to escape + and hyphen(when at the end)

this will split on empty string followed by your character class, but not preceded by the beginning of the string.


Consider the case of splitting the string "ad%" on regex pattern - "a(?=[dk+-])". This will give you an array with the first element as empty string. What the only change here is, the empty string is replaced with a:

"ad%".split("a(?=[dk+-])");  // Prints - `[, d%]`

Why? That's because the length of the matched string is 1. So the index value after the first match - m.end() wouldn't be 0 but 1, and hence the single element array won't be returned.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...