python - Pandas combining rows based on dates

Question

Welcome To Ask or Share your Answers For Others

python - Pandas combining rows based on dates

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Pandas combining rows based on dates

I have a dataframe of customers with records for shipments they received. Unfortunately, these can overlap. I'm trying to reduce rows so that I can see dates of consecutive use. Is there any way to do this besides a brute force iterrows implementation?

Here's a sample and what I'd like to do:

df = pd.DataFrame([['A','2011-02-07','2011-02-22',1],['A','2011-02-14','2011-03-10',2],['A','2011-03-07','2011-03-15',3],['A','2011-03-18','2011-03-25',4]], columns = ['Cust','startDate','endDate','shipNo'])
df

condensedDf = df.groupby(['Cust']).apply(reductionFunction)
condensedDF

the reductionFunction will group the first 3 records into one, because in each case the start date of the next is before the end date of the prior. I'm essentially turning multiple records that overlap into one record.

Thoughts on a good "pythonic" implementation? I could do a nasty while loop within each group, but I'd prefer not to...

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T01:13:37+0000

Fundamentally, I think this is a graph connectivity problem: a fast way of solving it will be some manner of graph connectivity algorithm. Pandas doesn't include such tools, but scipy does. You can use the compressed sparse graph (csgraph) submodule in scipy to solve your problem like this:

from scipy.sparse.csgraph import connected_components

# convert to datetime, so min() and max() work
df.startDate = pd.to_datetime(df.startDate)
df.endDate = pd.to_datetime(df.endDate)

def reductionFunction(data):
    # create a 2D graph of connectivity between date ranges
    start = data.startDate.values
    end = data.endDate.values
    graph = (start <= end[:, None]) & (end >= start[:, None])

    # find connected components in this graph
    n_components, indices = connected_components(graph)

    # group the results by these connected components
    return data.groupby(indices).aggregate({'startDate': 'min',
                                            'endDate': 'max',
                                            'shipNo': 'first'})

df.groupby(['Cust']).apply(reductionFunction).reset_index('Cust')

If you want to do something different with shipNo from here, it should be pretty straightforward.

Note that the connected_components() function above is not brute force, but uses a fast algorithm to find the connections.

Categories

python - Pandas combining rows based on dates

python - Pandas combining rows based on dates

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags