Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
196 views
in Technique[技术] by (71.8m points)

How to efficiently search between two dataframes in python pandas?

i have two dataframes (in pandas)

df1:

logged_at, item, value
2021-01-03 20:01:23, A, 4
2021-01-03 20:01:24, A, 5
2021-01-03 20:01:25, B, 4
2021-01-03 20:01:26, B, 7
2021-01-03 20:01:27, A, 10

df2:

id, start_time, end_time, item
2, 2021-01-03 20:01:00, 2021-01-03 20:05:33, A
3, 2021-01-03 20:01:11, 2021-01-03 21:44:12, B

i want new dataframe like new_df:

logged_at, item, value, id
2021-01-03 20:01:23, A, 4, 2
2021-01-03 20:01:24, A, 5, 2
2021-01-03 20:01:25, B, 4, 3
2021-01-03 20:01:26, B, 7, 3
2021-01-03 20:01:27, A, 10, 2

What I want is to attach the ID of df2 to the column of df1.

The condition is that the logged_at time of df1 exists between the start_time and the end_time of df2.

The number of data in df1 exceeds 900,000 and the number of data in df2 exceeds 100,000.

It takes too long to attach each row of df1.

Is there an efficient way?

question from:https://stackoverflow.com/questions/65625736/how-to-efficiently-search-between-two-dataframes-in-python-pandas

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

A simple merge does what you want with your sample data.

df1 = pd.read_csv(io.StringIO("""logged_at, item, value
2021-01-03 20:01:23, A, 4
2021-01-03 20:01:24, A, 5
2021-01-03 20:01:25, B, 4
2021-01-03 20:01:26, B, 7
2021-01-03 20:01:27, A, 10"""), skipinitialspace=True)

df2 = pd.read_csv(io.StringIO("""id, start_time, end_time, item
2, 2021-01-03 20:01:00, 2021-01-03 20:05:33, A
3, 2021-01-03 20:01:11, 2021-01-03 21:44:12, B"""), skipinitialspace=True)

new_df = df1.merge(df2.loc[:,["id","item"]], on="item")

output

           logged_at item  value  id
 2021-01-03 20:01:23    A      4   2
 2021-01-03 20:01:24    A      5   2
 2021-01-03 20:01:27    A     10   2
 2021-01-03 20:01:25    B      4   3
 2021-01-03 20:01:26    B      7   3

pandasql

Does what you specified, however your sample data in df2 looks wrong as it gives two rows for each row in df1

from pandasql import sqldf
import pandas as pd
import io

df1 = pd.read_csv(io.StringIO("""logged_at, item, value
2021-01-03 20:01:23, A, 4
2021-01-03 20:01:24, A, 5
2021-01-03 20:01:25, B, 4
2021-01-03 20:01:26, B, 7
2021-01-03 20:01:27, A, 10"""), skipinitialspace=True)
df1["logged_at"] = pd.to_datetime(df1["logged_at"])

df2 = pd.read_csv(io.StringIO("""id, start_time, end_time, item
2, 2021-01-03 20:01:00, 2021-01-03 20:05:33, A
3, 2021-01-03 20:01:11, 2021-01-03 21:44:12, B"""), skipinitialspace=True)
df2["start_time"] = pd.to_datetime(df2["start_time"])
df2["end_time"] = pd.to_datetime(df2["end_time"])

pysqldf = lambda q: sqldf(q, globals())
pysqldf("""
select df1.*, df2.*
from df1 
left join df2 on df1.logged_at >= df2.start_time and df1.logged_at <= df2.end_time""")

pandasql output

                 logged_at item  value  id                  start_time                    end_time item
 2021-01-03 20:01:23.000000    A      4   2  2021-01-03 20:01:00.000000  2021-01-03 20:05:33.000000    A
 2021-01-03 20:01:23.000000    A      4   3  2021-01-03 20:01:11.000000  2021-01-03 21:44:12.000000    B
 2021-01-03 20:01:24.000000    A      5   2  2021-01-03 20:01:00.000000  2021-01-03 20:05:33.000000    A
 2021-01-03 20:01:24.000000    A      5   3  2021-01-03 20:01:11.000000  2021-01-03 21:44:12.000000    B
 2021-01-03 20:01:25.000000    B      4   2  2021-01-03 20:01:00.000000  2021-01-03 20:05:33.000000    A
 2021-01-03 20:01:25.000000    B      4   3  2021-01-03 20:01:11.000000  2021-01-03 21:44:12.000000    B
 2021-01-03 20:01:26.000000    B      7   2  2021-01-03 20:01:00.000000  2021-01-03 20:05:33.000000    A
 2021-01-03 20:01:26.000000    B      7   3  2021-01-03 20:01:11.000000  2021-01-03 21:44:12.000000    B
 2021-01-03 20:01:27.000000    A     10   2  2021-01-03 20:01:00.000000  2021-01-03 20:05:33.000000    A
 2021-01-03 20:01:27.000000    A     10   3  2021-01-03 20:01:11.000000  2021-01-03 21:44:12.000000    B

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...