python 3.x - How to remove duplicate timestamp overlaps and add column in the original dataframe from zipped list findings

Question

Welcome To Ask or Share your Answers For Others

python 3.x - How to remove duplicate timestamp overlaps and add column in the original dataframe from zipped list findings

posted Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

python 3.x - How to remove duplicate timestamp overlaps and add column in the original dataframe from zipped list findings

I have two columns in my dataframe 'START_TIME' and 'END_TIME' which i zipped into a list and brought it to the below form. Used the following snippet to generate that.

zippedList = list(zip(new_df['START_TIME'],new_df['END_TIME']))

[(Timestamp('2020-06-09 06:00:00'), Timestamp('2020-06-09 16:00:00')),
 (Timestamp('2020-06-09 02:00:00'), Timestamp('2020-06-09 06:00:00')),
 (Timestamp('2020-06-10 02:00:00'), Timestamp('2020-06-10 06:00:00')),
 (Timestamp('2020-06-09 16:00:00'), Timestamp('2020-06-10 02:00:00')),
 (Timestamp('2020-06-10 06:00:00'), Timestamp('2020-06-10 16:00:00')),
 (Timestamp('2020-06-10 16:00:00'), Timestamp('2020-06-11 02:00:00')),
 (Timestamp('2020-06-11 02:00:00'), Timestamp('2020-06-11 06:00:00')),
 (Timestamp('2020-06-11 01:00:00'), Timestamp('2020-06-11 05:00:00')),
 (Timestamp('2020-06-11 06:00:00'), Timestamp('2020-06-11 16:00:00')),
 (Timestamp('2020-06-11 16:00:00'), Timestamp('2020-06-12 02:00:00'))]

I went on to iterate through this list and find overlapping values too through this one

for elem1 in zippedList:
    for elem2 in zippedList:
        #print(elem1,elem2)
        i1= pd.Interval(elem1[0],elem1[1],closed='neither')
        i2= pd.Interval(elem2[0],elem2[1],closed='neither')
        if (i1.overlaps(i2)) and elem1!=elem2:
            print('OVERLAP FOUND!!')
            print(i1,i2)

Got these duplicated overlaps.

OVERLAP FOUND!!
(2020-06-10 16:00:00, 2020-06-11 02:00:00) (2020-06-11 01:00:00, 2020-06-11 05:00:00)
OVERLAP FOUND!!
(2020-06-11 02:00:00, 2020-06-11 06:00:00) (2020-06-11 01:00:00, 2020-06-11 05:00:00)
OVERLAP FOUND!!
(2020-06-11 01:00:00, 2020-06-11 05:00:00) (2020-06-10 16:00:00, 2020-06-11 02:00:00)
OVERLAP FOUND!!
(2020-06-11 01:00:00, 2020-06-11 05:00:00) (2020-06-11 02:00:00, 2020-06-11 06:00:00)

I have a couple of questions here.

How do I avoid these duplicate overlaps found. For instance, (2020-06-11 02:00:00, 2020-06-11 06:00:00) (2020-06-11 01:00:00, 2020-06-11 05:00:00) and (2020-06-11 01:00:00, 2020-06-11 05:00:00) (2020-06-11 02:00:00, 2020-06-11 06:00:00) are the same!
How do I create a boolean column in the original dataframe (new_df) to mark True for all timestamp pairs which have been found to have overlapped. For instance, along the rows Timestamp('2020-06-11 02:00:00'), Timestamp('2020-06-11 06:00:00')) and (Timestamp('2020-06-11 01:00:00'), Timestamp('2020-06-11 05:00:00')), should be marked as True. How do i achieve this? Note that the overlap findings are performed with the zipped list(zippedList) and not with my dataframe(new_df).

Thanks in advance!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2022-01-31T07:23:31+0000

Since you're just interested if there is an overlap (and not how many), you can break the inner for loop as soon as you find an overlap. Create the boolean mask as you check for overlaps in the nested for loops.

import pandas as pd

zippedList = (
    [(pd.Timestamp('2020-06-09 06:00:00'), pd.Timestamp('2020-06-09 16:00:00')),
     (pd.Timestamp('2020-06-09 02:00:00'), pd.Timestamp('2020-06-09 06:00:00')),
     (pd.Timestamp('2020-06-10 02:00:00'), pd.Timestamp('2020-06-10 06:00:00')),
     (pd.Timestamp('2020-06-09 16:00:00'), pd.Timestamp('2020-06-10 02:00:00')),
     (pd.Timestamp('2020-06-10 06:00:00'), pd.Timestamp('2020-06-10 16:00:00')),
     (pd.Timestamp('2020-06-10 16:00:00'), pd.Timestamp('2020-06-11 02:00:00')),
     (pd.Timestamp('2020-06-11 02:00:00'), pd.Timestamp('2020-06-11 06:00:00')),
     (pd.Timestamp('2020-06-11 01:00:00'), pd.Timestamp('2020-06-11 05:00:00')),
     (pd.Timestamp('2020-06-11 06:00:00'), pd.Timestamp('2020-06-11 16:00:00')),
     (pd.Timestamp('2020-06-11 16:00:00'), pd.Timestamp('2020-06-12 02:00:00'))]
    )

# map to intervals before looping
intervals = list(map(lambda i: pd.Interval(i[0], i[1], closed='neither'), zippedList))

m = []
for i1 in intervals:
    for i2 in intervals:
        if (i1.overlaps(i2)) and i1 != i2:
            m.append(True)
            break
    else: # else clause will only be called if break wasn't executed
        m.append(False)

for b, t in zip(m, zippedList):
    if b:
        print(t)

# (Timestamp('2020-06-10 16:00:00'), Timestamp('2020-06-11 02:00:00'))
# (Timestamp('2020-06-11 02:00:00'), Timestamp('2020-06-11 06:00:00'))
# (Timestamp('2020-06-11 01:00:00'), Timestamp('2020-06-11 05:00:00'))

Now you can make m a column of your df, i.e. new_df['OVERLAPS'] = m

Categories

python 3.x - How to remove duplicate timestamp overlaps and add column in the original dataframe from zipped list findings

python 3.x - How to remove duplicate timestamp overlaps and add column in the original dataframe from zipped list findings

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags