If you want to delete all characters in Other/Control Unicode category, you can do something like this:
System.out.println(
"au0000bu0007cu008fd".replaceAll("\p{Cc}", "")
); // abcd
Note that this actually removes (among others) 'u008f'
Unicode character from the string, not the escaped form "%8F"
string.
If the blacklist is not nicely captured by one Unicode block/category, Java does have a powerful character class arithmetics featuring intersection, subtraction, etc that you can use. Alternatively you can also use a negated whitelist approach, i.e. instead of explicitly specifying what characters are illegal, you specify what are legal, and everything else then becomes illegal.
API links
Examples
Here's a subtraction example:
System.out.println(
"regular expressions: now you have two problems!!"
.replaceAll("[a-z&&[^aeiou]]", "_")
);
// _e_u_a_ e___e__io__: _o_ _ou _a_e __o __o__e__!!
The […]
is a character class. Something like [aeiou]
matches one of any of the lowercase vowels. [^…]
is a negated character class. [^aeiou]
matches one of anything but the lowercase vowels.
[a-z&&[^aeiou]]
matches [a-z]
subtracted by [aeiou]
, i.e. all lowercase consonants.
The next example shows the negated whitelist approach:
System.out.println(
"regular expressions: now you have two problems!!"
.replaceAll("[^a-z]", "_")
);
// regular_expressions__now_you_have_two_problems__
Only lowercase letters a-z
are legal; everything else is illegal.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…