Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
293 views
in Technique[技术] by (71.8m points)

python - how to sort dataframe2 according to dataframe1 with fuzzywuzzy

I know this is old question in fact i have seen many links related to my question:

Using fuzzywuzzy to create a column of matched results in the data frame

How to compare a value in one dataframe to a column in another using fuzzywuzzy ratio

What's the best way to use fuzzywuzzy to compare each value of a column with all the values of a separate dataframe's column?

But i didnt get any proper solution for this

below is my code:

g = [{'column1': 'ryzen 5 5600'},{'column1':'ram 8 gb ddr4 3.2ghz'}, {'column2':'SSD 
                                     220gb'}, {'column3':'windows 10 prof'}, 
                            {'column2':'ryzen 5 3600'}, {'column1':'ram 16 gb ddr4'}]

df1=pd.read_excel('product1.xlsx', header=None, index_col=False)
s = []
for l in df1.values:
    l = ', '.join(l)
    s.append(l)

s = ', '.join(s)    

MIN_MATCH_SCORE = 30
guessed_word = [d for d in g if fuzz.token_set_ratio(s, list(d.values())[0]) >= 30]

product1 contains:

0   GB ddr4
1   HDD 256GB
2   SSD
3   ryzen 5
4   Win 10 Pro

guessed_word contains:

#gives good output

[{'column1': 'ryzen 5 5600'},
 {'column1': 'ram 8 gb ddr4 3.2ghz'}, 
 {'column2': 'SSD 220gb'}, 
 {'column3': 'windows 10 prof'}, 
 {'column2': 'ryzen 5 3600'}, 
 {'column1': 'ram 16 gb ddr4'}]

After appending to dataframe:

df3 = pd.Dataframe(guessed_word)

df3 contains:

column1                  column2          column3
     
ryzen 5 5600             SSD 220gb        windows 10 prof
ram 8 gb ddr4 3.2ghz     ryzen 5 3600
ram 16 gb ddr4

But i want following output:

#product1              column1                                column2          column3
0   GB ddr4            ram 8 gb ddr4 3.2ghz, ram 16 gb ddr4   NAN               NAN
1   HDD 256GB          NAN                                    NAN               NAN 
2   SSD                NAN                                    SSD 220gb               NAN
3   ryzen 5            ryzen 5 5600                           ryzen 5 3600      NAN
4   Win 10 Pro         NAN                                    NAN        windows 10 prof

is it possible to sort with df.sort_values or anything ? i tried and none of that are working.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The code is kinda lengthy, but it does exactly what you are expecting.

import re
import pandas as pd
#from fuzzywuzzy import fuzz, process

class CustomMatcher:
    def add_space_before_numbers(self, text):
        return (re.sub(r'([0-9.]+)', r' 1', text)).replace('  ', ' ')

    def add_space_before_numbers(self, text):
        return re.sub(r'([0-9.]+)', r' 1', text)

    def add_space_after_numbers(self, text):
        return re.sub(r'([0-9.]+)([^0-9.])', r'1 2', text)

    def pad_spaces(self, text):
        result = self.add_space_before_numbers(text)
        result = self.add_space_after_numbers(result)
        return result.replace('  ', ' ')
        
    def partial_word_score(self, word1, word2):
        score = 0
        len1 = len(word1)
        len2 = len(word2)
        if len2 > len1:
            temp = word2
            word2 = [*word1]
            word1 = [*temp]
        else:
            word2 = [*word2]
            word1 = [*word1]

        for i, char in enumerate(word2):
            if word1[i] == char:
                score = score + 1
        if min(len1, len2) != 0:
            return (score*100) / min(len1, len2)
        else:
            return 0

    def match(self, comparand, target):
        len_c = len(comparand)
        len_t = len(target)
        comparand_words = self.pad_spaces(comparand.lower()).split(' ')
        target_words = self.pad_spaces(target.lower()).split(' ')

        complete_score = 0
        for t_word in target_words:
            for c_word in comparand_words:
                len1 = len(t_word)
                len2 = len(c_word)
                word_score = self.partial_word_score(t_word, c_word)
                    * (min(len1, len2) / min(len_c, len_t))
                complete_score = complete_score + word_score
        return complete_score

search_array = [
    {'column1': 'ryzen 5 5600'},
    {'column1': 'ram 8 gb ddr4 3.2ghz'},
    {'column2': 'SSD 220gb'},
    {'column3': 'windows 10 prof'},
    {'column2': 'ryzen 5 3600'},
    {'column1': 'ram 16 gb ddr4'}
]

search_dict = {}
for entry in search_array:
    key = [*entry][0]
    value = entry[key]
    if key in [*search_dict]:
        search_dict[key].append(value)
    else:
        search_dict[key] = [value]

filename = 'product1.xlsx'
products_sheet = pd.read_excel(filename, header=None, index_col=False)
#word_set = ', '.join([x[0] for x in products_sheet.values.tolist()])
#MIN_MATCH_SCORE = 30
products_list = [x[0] for x in products_sheet.values.tolist()]
# Column #1
result_data = {}
result_data[filename.replace('.xlsx','')] = products_list

# Initialize columns #2-#n and populate it with placeholder values
columns = [*search_dict]
for column in columns:
    result_data[column]=list(range(products_list.__len__()))

for row_no, row in enumerate(products_list):
    for column in columns:
        matched_products_list=[]
        for product in search_dict[column]:
            print(f'Comparing {row} to {product} is:', end='')
            cm = CustomMatcher()
            matching_score = cm.match(row, product)
            if matching_score>50:
            #if fuzz.token_set_ratio(row, product)>25:
                print(matching_score, ' accepted')
                matched_products_list.append(product)
            else:
                print (matching_score, ' rejected')
        if (matched_products_list != []):
            result_data[column][row_no] = matched_products_list
        else:
            result_data[column][row_no] = 'NAN'

result_df = pd.DataFrame(data=result_data)
print(result_df)

Notes:

  • I have created a CustomMatcher instead of using this fuzzywuzzy thing that is acting too crazy to have a meaningful threshold level to filter on. CustomMatcher is word-based in calculating the score, but letter-based on the comparison. It isolates numbers as words to be matched to after padding them with spaces. Those 50+ lines are easily accessible through the function CustomMatcher.match(word1, word2) I used matching_score>50 as a reasonable sensitivity threshold for your application to match.

  • You do not need to define a join between the entries in a single cell, instead, I used lists that ease access to every individual unit.

  • The output is packed as a pandas data frame.

Thanks,


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...