Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
241 views
in Technique[技术] by (71.8m points)

python - lower() vs. casefold() in string matching and converting to lowercase

How do I do a case-insensitive string comparison?

From what I understood from Google and the link above that both functions: lower() and casefold() will convert the string to lowercase, but casefold() will convert even the caseless letters such as the ? in German to ss.

All of that about Greek letters, but my question in general:

  • are there any other differences?
  • which one is better to convert to lowercase?
  • which one is better to check the matching strings?

Part 2:

firstString = "der Flu?"
secondString = "der Fluss"

# ? is equivalent to ss
if firstString.casefold() == secondString.casefold():
    print('The strings are equal.')
else:
    print('The strings are not equal.')

In the example above should I use:

lower() # the result is not equal which make sense to me

Or:

casefold() # which ? is ss and result is the
        # strings are equal. (since I am a beginner that still does not
        # make sense to me. I see different strings).
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

TL;DR

  • Purely ASCII Text -> lower()
  • Unicode text/user input -> casefold()

Casefolding is a more aggressive version of lower() that is set up to make many of the more unique unicode characters more comparable. It is another form of normalizing text that may initially appear to be very different, but it takes characters of many different languages into account.

I suggest you take a closer look into what case folding actually is, so here's a good start: W3 Case Folding Wiki

To answer your other two questions, if you are working strictly in the English language, lower() and casefold() should be yielding exactly the same results. However, if you are trying to normalize text from other languages that use more than our simple 26-letter alphabet (using only ASCII), I would use casefold() to compare your strings, as it will yield more consistent results.

Another source: Elastic.co Case Folding

Edit: I just recently found another very good related answer to a slightly different question here on SO (doing a case-insensitive string comparison)


Another Edit: @Voo's comments have been bouncing around in the back of my mind for a few months, so here are some further thoughts:

As Voo mentioned, there aren't any languages that never use text outside the standard ASCII values. That's pretty much why Unicode exists. With that in mind, it makes more sense to me to use casefold() on anything that is user-entered that can contain non-ascii values. This might end up excluding some text that might come from a database that strictly deals with ASCII, but, in general, probably most user input would be dealt with using casefold() because it has the logic to properly de-uppercase all of the characters.

On the other hand, values that are known to be generated into the ASCII character space like hex UUIDs or something like that should be normalized with lower() because it is a much simpler transformation. Simply put, lower() will require less memory or less time because there are no lookups, and it's only dealing with 26 characters it has to transform. Additionally, if you know that the source of your information is coming from a CHAR or VARCHAR (SQL Server fields) database field, you can similarly just use lower because Unicode characters can't be entered into those fields.

So really, this question comes down to knowing the source of your data, and when in doubt about your user-entered information, just casefold().


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...