Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
438 views
in Technique[技术] by (71.8m points)

python - Comparing two columns of a csv and outputting string similarity ratio in another csv

I am very new to python programming. I am trying to take a csv file that has two columns of string values and want to compare the similarity ratio of the string between both columns. Then I want to take the values and output the ratio in another file.

The csv may look like this:

Column 1|Column 2 
tomato|tomatoe 
potato|potatao 
apple|appel 

I want the output file to show for each row, how similar the string in Column 1 is to Column 2. I am using difflib to output the ratio score.

This is the code I have so far:

import csv
import difflib

f = open('test.csv')

csf_f = csv.reader(f)

row_a = []
row_b = []

for row in csf_f:
    row_a.append(row[0])
    row_b.append(row[1])

a = row_a
b = row_b

def similar(a, b):
    return difflib.SequenceMatcher(a, b).ratio()

match_ratio = similar(a, b)

match_list = []
for row in match_ratio:
    match_list.append(row)

with open("output.csv", "wb") as f:
    writer = csv.writer(f, delimiter=',')
    writer.writerows(match_list)

f.close()

I get the error:

Traceback (most recent call last):
  File "comparison.py", line 24, in <module>
    for row in match_ratio:
TypeError: 'float' object is not iterable

I feel like I am not importing the column list correctly and running it against the sequencematcher function.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The for loop you're setting up here expects something like an array where you have match_ratio, and judging by the error you're getting, that's not what you have. It looks like you're missing the first argument for difflib.SequenceMatcher, which should probably be None. See 6.3.1 here: https://docs.python.org/3/library/difflib.html

Without that first argument specified, I think you're getting back 0.0 from difflib.SequenceMatcher and then trying to run ratio off of that. Even if you correct your SequenceMatcher call, I think you'll still be trying to iterate on a single float value that ratio is returning. I think you need to call SequenceMatcher inside the loop for each set of values you're comparing.

So you'd wind up with a call more like this in your function: difflib.SequenceMatcher(None, a, b). Or if you'd prefer, since these are named arguments, you could do something like this: difflib.SequenceMatcher(a=a, b=b).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...