Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
220 views
in Technique[技术] by (71.8m points)

python 3.x - How to remove duplicate timestamp overlaps and add column in the original dataframe from zipped list findings

I have two columns in my dataframe 'START_TIME' and 'END_TIME' which i zipped into a list and brought it to the below form. Used the following snippet to generate that.

zippedList = list(zip(new_df['START_TIME'],new_df['END_TIME']))
[(Timestamp('2020-06-09 06:00:00'), Timestamp('2020-06-09 16:00:00')),
 (Timestamp('2020-06-09 02:00:00'), Timestamp('2020-06-09 06:00:00')),
 (Timestamp('2020-06-10 02:00:00'), Timestamp('2020-06-10 06:00:00')),
 (Timestamp('2020-06-09 16:00:00'), Timestamp('2020-06-10 02:00:00')),
 (Timestamp('2020-06-10 06:00:00'), Timestamp('2020-06-10 16:00:00')),
 (Timestamp('2020-06-10 16:00:00'), Timestamp('2020-06-11 02:00:00')),
 (Timestamp('2020-06-11 02:00:00'), Timestamp('2020-06-11 06:00:00')),
 (Timestamp('2020-06-11 01:00:00'), Timestamp('2020-06-11 05:00:00')),
 (Timestamp('2020-06-11 06:00:00'), Timestamp('2020-06-11 16:00:00')),
 (Timestamp('2020-06-11 16:00:00'), Timestamp('2020-06-12 02:00:00'))]

I went on to iterate through this list and find overlapping values too through this one

for elem1 in zippedList:
    for elem2 in zippedList:
        #print(elem1,elem2)
        i1= pd.Interval(elem1[0],elem1[1],closed='neither')
        i2= pd.Interval(elem2[0],elem2[1],closed='neither')
        if (i1.overlaps(i2)) and elem1!=elem2:
            print('OVERLAP FOUND!!')
            print(i1,i2)          

Got these duplicated overlaps.

OVERLAP FOUND!!
(2020-06-10 16:00:00, 2020-06-11 02:00:00) (2020-06-11 01:00:00, 2020-06-11 05:00:00)
OVERLAP FOUND!!
(2020-06-11 02:00:00, 2020-06-11 06:00:00) (2020-06-11 01:00:00, 2020-06-11 05:00:00)
OVERLAP FOUND!!
(2020-06-11 01:00:00, 2020-06-11 05:00:00) (2020-06-10 16:00:00, 2020-06-11 02:00:00)
OVERLAP FOUND!!
(2020-06-11 01:00:00, 2020-06-11 05:00:00) (2020-06-11 02:00:00, 2020-06-11 06:00:00)

I have a couple of questions here.

  1. How do I avoid these duplicate overlaps found. For instance, (2020-06-11 02:00:00, 2020-06-11 06:00:00) (2020-06-11 01:00:00, 2020-06-11 05:00:00) and (2020-06-11 01:00:00, 2020-06-11 05:00:00) (2020-06-11 02:00:00, 2020-06-11 06:00:00) are the same!

  2. How do I create a boolean column in the original dataframe (new_df) to mark True for all timestamp pairs which have been found to have overlapped. For instance, along the rows Timestamp('2020-06-11 02:00:00'), Timestamp('2020-06-11 06:00:00')) and (Timestamp('2020-06-11 01:00:00'), Timestamp('2020-06-11 05:00:00')), should be marked as True. How do i achieve this? Note that the overlap findings are performed with the zipped list(zippedList) and not with my dataframe(new_df).

Thanks in advance!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Since you're just interested if there is an overlap (and not how many), you can break the inner for loop as soon as you find an overlap. Create the boolean mask as you check for overlaps in the nested for loops.

import pandas as pd

zippedList = (
    [(pd.Timestamp('2020-06-09 06:00:00'), pd.Timestamp('2020-06-09 16:00:00')),
     (pd.Timestamp('2020-06-09 02:00:00'), pd.Timestamp('2020-06-09 06:00:00')),
     (pd.Timestamp('2020-06-10 02:00:00'), pd.Timestamp('2020-06-10 06:00:00')),
     (pd.Timestamp('2020-06-09 16:00:00'), pd.Timestamp('2020-06-10 02:00:00')),
     (pd.Timestamp('2020-06-10 06:00:00'), pd.Timestamp('2020-06-10 16:00:00')),
     (pd.Timestamp('2020-06-10 16:00:00'), pd.Timestamp('2020-06-11 02:00:00')),
     (pd.Timestamp('2020-06-11 02:00:00'), pd.Timestamp('2020-06-11 06:00:00')),
     (pd.Timestamp('2020-06-11 01:00:00'), pd.Timestamp('2020-06-11 05:00:00')),
     (pd.Timestamp('2020-06-11 06:00:00'), pd.Timestamp('2020-06-11 16:00:00')),
     (pd.Timestamp('2020-06-11 16:00:00'), pd.Timestamp('2020-06-12 02:00:00'))]
    )

# map to intervals before looping
intervals = list(map(lambda i: pd.Interval(i[0], i[1], closed='neither'), zippedList))

m = []
for i1 in intervals:
    for i2 in intervals:
        if (i1.overlaps(i2)) and i1 != i2:
            m.append(True)
            break
    else: # else clause will only be called if break wasn't executed
        m.append(False)

for b, t in zip(m, zippedList):
    if b:
        print(t)

# (Timestamp('2020-06-10 16:00:00'), Timestamp('2020-06-11 02:00:00'))
# (Timestamp('2020-06-11 02:00:00'), Timestamp('2020-06-11 06:00:00'))
# (Timestamp('2020-06-11 01:00:00'), Timestamp('2020-06-11 05:00:00'))

Now you can make m a column of your df, i.e. new_df['OVERLAPS'] = m


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...