Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.9k views
in Technique[技术] by (71.8m points)

nlp - How to process similar notations with Python?

I have list with keywords and their corresponding searchvolumes as CSV file. Keywords are in german language. Some keywords are unique, other keywords have slightly different notations, like in this example:

+--------------------------------------+-----+
| verkehrsrechtsschutz rückwirkend     | 50  |
+--------------------------------------+-----+
| verkehrs-rechtsschutz rückwirkend    | 50  |
+--------------------------------------+-----+
| familien rechtsschutzversicherung    | 100 |
+--------------------------------------+-----+
| familienrechtsschutzversicherung     | 100 |
+--------------------------------------+-----+
| privat rechtsschutz ohne wartezeit   | 20  |
+--------------------------------------+-----+
| privater rechtsschutz ohne wartezeit | 20  |
+--------------------------------------+-----+
| rechtsschutzversicherung strafrecht  | 80  |
+--------------------------------------+-----+
| strafrechtsschutz                    | 80  |
+--------------------------------------+-----+
| rechtsschutzversicherung gewerbe     | 200 |
+--------------------------------------+-----+
| rechtsschutzversicherung gewerblich  | 200 |
+--------------------------------------+-----+
| fahrer rechtsschutz                  | 160 |
+--------------------------------------+-----+
| fahrerrechtsschutz                   | 160 |
+--------------------------------------+-----+
| fahrer-rechtsschutz                  | 160 |
+--------------------------------------+-----+

Similar noted keywords often have same suchvolumes - but not always.

I'm looking a way to move all keywords with similar notation into another file.

I guess, it could be done with Python, but don't know, what module, package or library has such special language processing capability to recognize similar notations and to decide about relation of keywords between each other.

Please point me into the right direction.

Update: Solutions, which calculate similarity ratio will deliver very high amount of false positives and negatives - because of german language structure. I think rather about a tool, which "knows" german linguistics and works with a language and not with string differences. Maybe something like https://pypi.org/project/textblob-de/, https://spacy.io/models/de or something from https://github.com/adbar/German-NLP

I was already trying some tools, which calculate string differences - some VBA and Google App scripts, they fail miserably.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Using the SequenceMatcher class of the difflib module, you can get how similar are two strings:

from difflib import SequenceMatcher
s1 = 'rechtsschutzversicherung gewerbe'
s2 = 'rechtsschutzversicherung gewerblich'
print( SequenceMatcher(a=s1, b=s2).ratio() ) # Prints 0.9253731343283582

s3 = 'fahrer rechtsschutz'
print( SequenceMatcher(a=s1, b=s3).ratio() ) # Prints 0.47058823529411764

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...