pandas - What's a more efficient way to merge rows from DataFrames row-by-row with conditions?

Question

Welcome To Ask or Share your Answers For Others

pandas - What's a more efficient way to merge rows from DataFrames row-by-row with conditions?

posted Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

pandas - What's a more efficient way to merge rows from DataFrames row-by-row with conditions?

I'm joining two tables with data from two systems. A simple Pandas merge between two df won't honor more complex rules (unless I'm using it wrong, don't understand the process merge is implementing--very possible).

I've cobbled together a toy solution that lets me unpack two df's with itertuples, validate matches based on values, and then repack into one dataframe:

df1:            df2:
   A   X           B   Y
0  1  10        0  2  10
1  5  15        1  4  15
                2  6  15

.

df1 = pd.DataFrame(data1,columns=['A','X'])
df2 = pd.DataFrame(data2,columns=['B','Y'])
df3 = pd.DataFrame(index=['A','X','B','Y'])
i = -1

for rowA in df1.itertuples(index=False):
    i += 1
    for rowB in df2.itertuples(index=False):
        A,X = rowA
        B,Y = rowB
        if (B > A) & (X==Y):
            df3[i] = list(rowA+rowb)
        else:
            continue

print(df3.transpose())

.

   A   X  B   Y
0  1  10  2  10
1  5  15  6  15

My naive approach is inefficient

The nested for() loop is inefficient because I'm iterating over data2/df2 for each entry of data1. Once I get a good match with data2/df2, the row should be removed.

//UPDATE (show the origin of my question)

An example of the type of data I'm working with merges two independent systems which do not share any keys or other serialized IDs. Since I can't make an exact match, I must rely on logical/arithmetic operations and the process of elimination.

In the following example a simple pandas.merge fails on Line3, because the Time1 < Time2.

   Time1,               Total1 ... Time2,               Total2, error
1, 2017-02-19 08:03:00, 15.00  ... 2017-02-19 08:02:00,  15.00, 0
2, 2017-02-19 08:28:00, 33.00  ... 2017-02-19 08:27:00,  33.00, 0
3, 2017-02-19 08:40:00, 20.00  ... 2017-02-19 10:06:00,  20.00, 1
4, 2017-02-19 10:08:00, 20.00  ... 2017-02-19 10:16:00,  20.00, 1
[...]

What should happen is something like this:

   Time1,               Total1 ... Time2,               Total2, error
1, 2017-02-19 08:03:00, 15.00  ... 2017-02-19 08:02:00,  15.00, 0
2, 2017-02-19 08:28:00, 33.00  ... 2017-02-19 08:27:00,  33.00, 0
3, 2017-02-19 08:40:00, 20.00  ... NaN,                  NaN,   NaN
4, 2017-02-19 10:08:00, 20.00  ... 2017-02-19 10:06:00,  20.00, 0
[...]

// UPDATE2 I've worked on several permutations of merge_asof() and join() recommended in answers. Each method was also sorted as directed by docs. Assuming I've implemented each correctly, the following percentages are True matches of the rules ((time1>=time2) & (Total1==Total2) out of 53 records) in my test set using each of three methods:

| type                  | 'date'   | 'total'   | both   |
|-----------------------|----------|-----------|--------|
| merg_asof sort (time) | .7924    | .9245     | .7169  |
| merg_asof (time,total)| .7735    | .6981     | .6226  |
| intertup (time,total) | .8301    | .8301     | .8301  |
| join ind (time)       | na       | na        | na     |

The join required a shared key, right? the on clause in the documentation states, "Column(s) in the caller to join on the index in other, otherwise joins index-on-index. If multiples columns given, the passed DataFrame must have a MultiIndex."

I tried join with a multi-index of (time,total) and just (time). The problem is, the join clobbers whatever you join on. There's nothing left to perform the error analysis on because those indexes are merged into one.

My naive intertuple solution (above) produced only perfect matches, but the solution still needs a collector for missed matches.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

Categories

pandas - What's a more efficient way to merge rows from DataFrames row-by-row with conditions?

pandas - What's a more efficient way to merge rows from DataFrames row-by-row with conditions?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags

Categories

pandas - What&#39;s a more efficient way to merge rows from DataFrames row-by-row with conditions?

pandas - What&#39;s a more efficient way to merge rows from DataFrames row-by-row with conditions?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags

pandas - What's a more efficient way to merge rows from DataFrames row-by-row with conditions?

pandas - What's a more efficient way to merge rows from DataFrames row-by-row with conditions?