I'm joining two tables with data from two systems. A simple Pandas merge between two df won't honor more complex rules (unless I'm using it wrong, don't understand the process merge is implementing--very possible).
I've cobbled together a toy solution that lets me unpack two df's with itertuples
, validate matches based on values, and then repack into one dataframe:
df1: df2:
A X B Y
0 1 10 0 2 10
1 5 15 1 4 15
2 6 15
.
df1 = pd.DataFrame(data1,columns=['A','X'])
df2 = pd.DataFrame(data2,columns=['B','Y'])
df3 = pd.DataFrame(index=['A','X','B','Y'])
i = -1
for rowA in df1.itertuples(index=False):
i += 1
for rowB in df2.itertuples(index=False):
A,X = rowA
B,Y = rowB
if (B > A) & (X==Y):
df3[i] = list(rowA+rowb)
else:
continue
print(df3.transpose())
.
A X B Y
0 1 10 2 10
1 5 15 6 15
My naive approach is inefficient
The nested for()
loop is inefficient because I'm iterating over data2/df2 for each entry of data1. Once I get a good match with data2/df2, the row should be removed.
//UPDATE (show the origin of my question)
An example of the type of data I'm working with merges two independent systems which do not share any keys or other serialized IDs. Since I can't make an exact match, I must rely on logical/arithmetic operations and the process of elimination.
In the following example a simple pandas.merge
fails on Line3, because the Time1 < Time2.
Time1, Total1 ... Time2, Total2, error
1, 2017-02-19 08:03:00, 15.00 ... 2017-02-19 08:02:00, 15.00, 0
2, 2017-02-19 08:28:00, 33.00 ... 2017-02-19 08:27:00, 33.00, 0
3, 2017-02-19 08:40:00, 20.00 ... 2017-02-19 10:06:00, 20.00, 1
4, 2017-02-19 10:08:00, 20.00 ... 2017-02-19 10:16:00, 20.00, 1
[...]
What should happen is something like this:
Time1, Total1 ... Time2, Total2, error
1, 2017-02-19 08:03:00, 15.00 ... 2017-02-19 08:02:00, 15.00, 0
2, 2017-02-19 08:28:00, 33.00 ... 2017-02-19 08:27:00, 33.00, 0
3, 2017-02-19 08:40:00, 20.00 ... NaN, NaN, NaN
4, 2017-02-19 10:08:00, 20.00 ... 2017-02-19 10:06:00, 20.00, 0
[...]
// UPDATE2
I've worked on several permutations of merge_asof()
and join()
recommended in answers. Each method was also sorted as directed by docs. Assuming I've implemented each correctly, the following percentages are True
matches of the rules ((time1>=time2) & (Total1==Total2) out of 53 records) in my test set using each of three methods:
| type | 'date' | 'total' | both |
|-----------------------|----------|-----------|--------|
| merg_asof sort (time) | .7924 | .9245 | .7169 |
| merg_asof (time,total)| .7735 | .6981 | .6226 |
| intertup (time,total) | .8301 | .8301 | .8301 |
| join ind (time) | na | na | na |
The join required a shared key, right? the on
clause in the documentation states, "Column(s) in the caller to join on the index in other, otherwise joins index-on-index. If multiples columns given, the passed DataFrame must have a MultiIndex."
I tried join
with a multi-index of (time,total) and just (time). The problem is, the join clobbers whatever you join on. There's nothing left to perform the error analysis on because those indexes are merged into one.
My naive intertuple
solution (above) produced only perfect matches, but the solution still needs a collector for missed matches.
See Question&Answers more detail:
os