python - Compare multiple pandas columns (1st and 2nd, after 3rd and 4rth, after etc) with vectorization (better) or other method

Question

Welcome To Ask or Share your Answers For Others

python - Compare multiple pandas columns (1st and 2nd, after 3rd and 4rth, after etc) with vectorization (better) or other method

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Compare multiple pandas columns (1st and 2nd, after 3rd and 4rth, after etc) with vectorization (better) or other method

This code compares based on condition the var1 and var2 and creates Results1 based on choices (this code works well):

# from: https://stackoverflow.com/questions/27474921/compare-two-columns-using-pandas?answertab=oldest#tab-top
# from: https://stackoverflow.com/questions/60099141/negation-in-np-select-condition

import pandas as pd
import numpy as np

# Creating one column from two columns. We asume that in every row there is one NaN and one value and that value fills new column.
df = pd.DataFrame({ 'var1': ['a', 'b', 'c',np.nan, np.nan],
                   'var2': [1, 2, np.nan , 4, np.nan], 
                   'var3': [np.nan , "x", np.nan, "y", "z"],
                   'var4': [np.nan , 4, np.nan, 5, 6],
                   'var5': ["a", np.nan , "b", np.nan, "c"],
                   'var6': [1, np.nan , 2, np.nan, 3]
                 })


#all conditions that are connected with logical operators (&, |, etc) should be in ().
conditions = [
    (df["var1"].notna()) & (df['var2'].notna()),
    (pd.isna(df["var1"])) & (pd.isna(df["var2"])),
    (df["var1"].notna()) & (pd.isna(df["var2"])),
    (pd.isna(df["var1"])) & (df['var2'].notna())]

choices = ["Both values", np.nan, df["var1"], df["var2"]]

df['Result1'] = np.select(conditions, choices, default=np.nan)

df looks like as it should:

|    | var1   |   var2 | var3   |   var4 | var5   |   var6 | Result1     |
|---:|:-------|-------:|:-------|-------:|:-------|-------:|:------------|
|  0 | a      |      1 | nan    |    nan | a      |      1 | Both values |
|  1 | b      |      2 | x      |      4 | nan    |    nan | Both values |
|  2 | c      |    nan | nan    |    nan | b      |      2 | c           |
|  3 | nan    |      4 | y      |      5 | nan    |    nan | 4           |
|  4 | nan    |    nan | z      |      6 | c      |      3 | nan         |

Now I want to compare multiple pandas columns (in my example var1 and var2, after var3 and var4, after var5 and var6) and based on condition and choices create corresponding Results column (in my example Result1, Result2, Result3). I thought the best way should be to use vectorization (because of better performance). The df I want to get should look like:

|    | var1   |   var2 | var3   |   var4 | var5   |   var6 | Result1     | Result2     | Result3     |
|---:|:-------|-------:|:-------|-------:|:-------|-------:|:------------|:------------|:------------|
|  0 | a      |      1 | nan    |    nan | a      |      1 | Both values | nan         | Both values |
|  1 | b      |      2 | x      |      4 | nan    |    nan | Both values | Both values | nan         |
|  2 | c      |    nan | nan    |    nan | b      |      2 | c           | nan         | Both values |
|  3 | nan    |      4 | y      |      5 | nan    |    nan | 4           | Both values | nan         |
|  4 | nan    |    nan | z      |      6 | c      |      3 | nan         | Both values | Both values |

I tried this:

import pandas as pd
import numpy as np

# Creating one column from two columns. We asume that in every row there is one NaN and one value and that value fills new column.
df = pd.DataFrame({ 'var1': ['a', 'b', 'c',np.nan, np.nan],
                   'var2': [1, 2, np.nan , 4, np.nan], 
                   'var3': [np.nan , "x", np.nan, "y", "z"],
                   'var4': [np.nan , 4, np.nan, 5, 6],
                   'var5': ["a", np.nan , "b", np.nan, "c"],
                   'var6': [1, np.nan , 2, np.nan, 3]
                 })


col1 = ["var1", "var3", "var5"]
col2 = ["var2", "var4", "var6"]
colR = ["Result1", "Result2", "Result3"]

#all conditions that are connected with logical operators (&, |, etc) should be in ().
conditions = [
    (df[col1].notna()) & (df[col2].notna()),
    (pd.isna(df[col1])) & (pd.isna(df[col2])),
    (df[col1].notna()) & (pd.isna(df[col2])),
    (pd.isna(df[col1])) & (df[col2].notna())]

choices = ["Both values", np.nan, df[col1], df[col2]]

df[colR] = np.select(conditions, choices, default=np.nan)

Buy it gave me error:

ValueError: shape mismatch: objects cannot be broadcast to a single shape

Question: How to achieve my goal with vectorization (preferable because of better performance) or other method?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:26:46+0000

The issue is that pandas DataFrames force alignment on the index, but df[col1] and df[col2] have no overlapping columns.

In this case, you really want to work with the underlying numpy arrays. Also because .isnull() is the opposite of notnull you can simplify this a lot. We'll concat to add the new columns back.

col1 = ["var1", "var3", "var5"]
col2 = ["var2", "var4", "var6"]
colR = ["Result1", "Result2", "Result3"]

s1 = df[col1].isnull().to_numpy()
s2 = df[col2].isnull().to_numpy()

conditions = [~s1 & ~s2, s1 & s2, ~s1 & s2, s1 & ~s2]
choices = ["Both values", np.nan, df[col1], df[col2]]

df = pd.concat([df, pd.DataFrame(np.select(conditions, choices), columns=colR, index=df.index)], axis=1)

  var1  var2 var3  var4 var5  var6      Result1      Result2      Result3
0    a   1.0  NaN   NaN    a   1.0  Both values          NaN  Both values
1    b   2.0    x   4.0  NaN   NaN  Both values  Both values          NaN
2    c   NaN  NaN   NaN    b   2.0            c          NaN  Both values
3  NaN   4.0    y   5.0  NaN   NaN            4  Both values          NaN
4  NaN   NaN    z   6.0    c   3.0          NaN  Both values  Both values

Categories

python - Compare multiple pandas columns (1st and 2nd, after 3rd and 4rth, after etc) with vectorization (better) or other method

python - Compare multiple pandas columns (1st and 2nd, after 3rd and 4rth, after etc) with vectorization (better) or other method

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags