Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
482 views
in Technique[技术] by (71.8m points)

language agnostic - What are some algorithms for comparing how similar two strings are?

I need to compare strings to decide whether they represent the same thing. This relates to case titles entered by humans where abbreviations and other small details may differ. For example, consider the following two titles:

std::string first = "Henry C. Harper v. The Law Offices of Huey & Luey, LLP";

As opposed to:

std::string second = "Harper v. The Law Offices of Huey & Luey, LLP";

A human can quickly gauge that these are most likely one and the same. The current approach I have taken is to normalize the strings by lowercasing all letters and removing all punctuation and spaces giving:

std::string firstNormalized = "henrycharpervthelawofficesofhueylueyllp";

And:

std::string secondNormalized = "harpervthelawofficesofhueylueyllp";

Comparing in this case, one is a sub-sequence of the other, but you can imagine other more complex variations where that does not necessarily occur, yet they have significant sub-sequences in common. There could also be occasional human entry errors such as transposed letters and spelling errors.

Perhaps some kind of character diff program could help? I've seen good line diff programs for comparing differences in code to be checked in, is there something like that on a character basis, maybe in boost? If you could count the number of consecutive characters in common and take the ratio to the characters unshared, perhaps that would be a good heuristic?

In the end, I need a Boolean decision as to whether to consider them the same or not. It doesn't have to be perfect, but it should ideally rarely be wrong.

What algorithm can I use that will give me some kind of quantification as to how similar the two strings are to each other which I can then convert into a yes/no answer by way of some heuristic?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

What you're looking for are called String Metric algorithms. There a significant number of them, many with similar characteristics. Among the more popular:

  • Levenshtein Distance : The minimum number of single-character edits required to change one word into the other. Strings do not have to be the same length
  • Hamming Distance : The number of characters that are different in two equal length strings.
  • Smith–Waterman : A family of algorithms for computing variable sub-sequence similarities.
  • S?rensen–Dice Coefficient : A similarity algorithm that computes difference coefficients of adjacent character pairs.

Have a look at these as well as others on the wiki page on the topic.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...