Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
163 views
in Technique[技术] by (71.8m points)

pandas - What's a more efficient way to merge rows from DataFrames row-by-row with conditions?

I'm joining two tables with data from two systems. A simple Pandas merge between two df won't honor more complex rules (unless I'm using it wrong, don't understand the process merge is implementing--very possible).

I've cobbled together a toy solution that lets me unpack two df's with itertuples, validate matches based on values, and then repack into one dataframe:

df1:            df2:
   A   X           B   Y
0  1  10        0  2  10
1  5  15        1  4  15
                2  6  15

.

df1 = pd.DataFrame(data1,columns=['A','X'])
df2 = pd.DataFrame(data2,columns=['B','Y'])
df3 = pd.DataFrame(index=['A','X','B','Y'])
i = -1

for rowA in df1.itertuples(index=False):
    i += 1
    for rowB in df2.itertuples(index=False):
        A,X = rowA
        B,Y = rowB
        if (B > A) & (X==Y):
            df3[i] = list(rowA+rowb)
        else:
            continue

print(df3.transpose())

.

   A   X  B   Y
0  1  10  2  10
1  5  15  6  15

My naive approach is inefficient

The nested for() loop is inefficient because I'm iterating over data2/df2 for each entry of data1. Once I get a good match with data2/df2, the row should be removed.

//UPDATE (show the origin of my question)

An example of the type of data I'm working with merges two independent systems which do not share any keys or other serialized IDs. Since I can't make an exact match, I must rely on logical/arithmetic operations and the process of elimination.

In the following example a simple pandas.merge fails on Line3, because the Time1 < Time2.

   Time1,               Total1 ... Time2,               Total2, error
1, 2017-02-19 08:03:00, 15.00  ... 2017-02-19 08:02:00,  15.00, 0
2, 2017-02-19 08:28:00, 33.00  ... 2017-02-19 08:27:00,  33.00, 0
3, 2017-02-19 08:40:00, 20.00  ... 2017-02-19 10:06:00,  20.00, 1
4, 2017-02-19 10:08:00, 20.00  ... 2017-02-19 10:16:00,  20.00, 1
[...]

What should happen is something like this:

   Time1,               Total1 ... Time2,               Total2, error
1, 2017-02-19 08:03:00, 15.00  ... 2017-02-19 08:02:00,  15.00, 0
2, 2017-02-19 08:28:00, 33.00  ... 2017-02-19 08:27:00,  33.00, 0
3, 2017-02-19 08:40:00, 20.00  ... NaN,                  NaN,   NaN
4, 2017-02-19 10:08:00, 20.00  ... 2017-02-19 10:06:00,  20.00, 0
[...]

// UPDATE2 I've worked on several permutations of merge_asof() and join() recommended in answers. Each method was also sorted as directed by docs. Assuming I've implemented each correctly, the following percentages are True matches of the rules ((time1>=time2) & (Total1==Total2) out of 53 records) in my test set using each of three methods:

| type                  | 'date'   | 'total'   | both   |
|-----------------------|----------|-----------|--------|
| merg_asof sort (time) | .7924    | .9245     | .7169  |
| merg_asof (time,total)| .7735    | .6981     | .6226  |
| intertup (time,total) | .8301    | .8301     | .8301  |
| join ind (time)       | na       | na        | na     |

The join required a shared key, right? the on clause in the documentation states, "Column(s) in the caller to join on the index in other, otherwise joins index-on-index. If multiples columns given, the passed DataFrame must have a MultiIndex."

I tried join with a multi-index of (time,total) and just (time). The problem is, the join clobbers whatever you join on. There's nothing left to perform the error analysis on because those indexes are merged into one.

My naive intertuple solution (above) produced only perfect matches, but the solution still needs a collector for missed matches.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

df3=df1.join(df2) does not do what you want?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...