Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
396 views
in Technique[技术] by (71.8m points)

regex - Merge two dataframes on string columns with values containing wilcards as for like in SQL - Python

I want to merge 2 dataframes on string columns with values containing wildcards as we can do with like in SQL.

Example :

import pandas as pd

df1 = pd.DataFrame({'A': ["He eat an apple in his office.", "There are many apples on the tree."], 'B': [1, 2]})
df2 = pd.DataFrame({'A': ["apple*tree", "apple*pie"], 'C': [4, 9]})

df1
                                    A  B
0      He eat an apple in his office.  1
1  There are many apples on the tree.  2

df2
            A  C
0  apple*tree  4
1   apple*pie  9


pd.merge(df1, df2, on = ['A']) 

# What it gives me :

Empty DataFrame
Columns: [A, B, C]
Index: []


# What I want:
                                    A  B  C
0  There are many apples on the tree.  2  4

I want to join the two dataframes and "apple*tree" of df2 has to match the sentence "There are many apples on the tree." of df1.

Can you help me to do this please?

I have found the function fnmatch.fnmatch(string, pattern) but can I use it in this case with a merge?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This can be done by using apply to search for df2's patterns in each row of df1. This will require runtime proportional to O(n*m), where n is the number of rows in df1, and m is the number of rows in df2. This is not very efficient, but that's fine for small dataframes.

Once we identify the matches between df1 and df2, we can merge the two dataframes. After that, we just need to clean up the dataframe and drop unneeded columns.

Code:

import pandas as pd
import fnmatch

df1 = pd.DataFrame({'A': ["He eat an apple in his office.", "There are many apples on the tree."], 'B': [1, 2]})
df2 = pd.DataFrame({'A': ["apple*tree", "apple*pie"], 'C': [4, 9]})

def wildcard_search(pattern):
    # Comment this line to require exact match
    pattern = "*" + pattern + "*"
    # Apply pattern to every A values within df1
    matching = df1['A'].apply(lambda x: fnmatch.fnmatch(x, pattern))
    # Get index of largest member
    first_match = matching.idxmax()
    # If we have all zeros, then first_match will refer to the first
    # zero. Check for this.
    if matching.loc[first_match] == 0:
        return None
    # print(first_match)
    return df1.loc[first_match, 'A']

# Using df2 patterns, search through df1. Record values found.
df2['merge_key'] = df2['A'].apply(wildcard_search)

# Merge two dataframes, on cols merge_key and A
res = df2.merge(
    df1,
    left_on='merge_key',
    right_on='A',
    suffixes=("_x", "")  # Don't add a suffix to df1's columns
)
# Reorder cols, drop unneeded
res = res[['A', 'B', 'C']]
print(res)

This answer is adapted from this post.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...