Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
483 views
in Technique[技术] by (71.8m points)

python - Fastest way to perform complex search on pandas dataframe

I am trying to figure out the fastest way to perform search and sort on a pandas dataframe. Below are before and after dataframes of what I am trying to accomplish.

Before:

flightTo  flightFrom  toNum  fromNum  toCode  fromCode
   ABC       DEF       123     456     8000    8000
   DEF       XYZ       456     893     9999    9999
   AAA       BBB       473     917     5555    5555
   BBB       CCC       917     341     5555    5555

After search/sort:

flightTo  flightFrom  toNum  fromNum  toCode  fromCode
   ABC       XYZ       123     893     8000    9999
   AAA       CCC       473     341     5555    5555

In this example I am essentially trying to filter out 'flights' that exist in between end destinations. This should be done by using some sort of drop duplicates method but what leaves me confused is how to handle all of the columns. Would a binary search be the best way to accomplish this? Hints appreciated, trying hard to figure this out.

possible edge case:

What if the data is switched up and our end connections are in the same column?

flight1  flight2      1Num    2Num     1Code   2Code
   ABC       DEF       123     456     8000    8000
   XYZ       DEF       893     456     9999    9999

After search/sort:

flight1  flight2      1Num    2Num     1Code   2Code
   ABC       XYZ       123     893     8000    9999

This case logically shouldn't happen. After all how can you go DEF-ABC and DEF-XYZ? You can't, but the 'endpoints' would still be ABC-XYZ

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This is network problem , so we using networkx , notice , here you can have more than two stops , which means you can have some case like NY-DC-WA-NC

import networkx as nx
G=nx.from_pandas_edgelist(df, 'flightTo', 'flightFrom')

# create the nx object from pandas dataframe

l=list(nx.connected_components(G))

# then we get the list of components which as tied to each other , 
# in a net work graph , they are linked 
L=[dict.fromkeys(y,x) for x, y in enumerate(l)]

# then from the above we can create our map dict , 
# since every components connected to each other , 
# then we just need to pick of of them as key , then map with others

d={k: v for d in L for k, v in d.items()}

# create the dict for groupby , since we need _from as first item and _to as last item 
grouppd=dict(zip(df.columns.tolist(),['first','last']*3))
df.groupby(df.flightTo.map(d)).agg(grouppd) # then using agg with dict yield your output 

Out[22]: 
         flightTo flightFrom  toNum  fromNum  toCode  fromCode
flightTo                                                      
0             ABC        XYZ    123      893    8000      9999
1             AAA        CCC    473      341    5555      5555

Installation networkx

  • Pip: pip install networkx
  • Anaconda: conda install -c anaconda networkx

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

56.9k users

...