Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
234 views
in Technique[技术] by (71.8m points)

python - Compare multiple pandas columns (1st and 2nd, after 3rd and 4rth, after etc) with vectorization (better) or other method

This code compares based on condition the var1 and var2 and creates Results1 based on choices (this code works well):

# from: https://stackoverflow.com/questions/27474921/compare-two-columns-using-pandas?answertab=oldest#tab-top
# from: https://stackoverflow.com/questions/60099141/negation-in-np-select-condition

import pandas as pd
import numpy as np

# Creating one column from two columns. We asume that in every row there is one NaN and one value and that value fills new column.
df = pd.DataFrame({ 'var1': ['a', 'b', 'c',np.nan, np.nan],
                   'var2': [1, 2, np.nan , 4, np.nan], 
                   'var3': [np.nan , "x", np.nan, "y", "z"],
                   'var4': [np.nan , 4, np.nan, 5, 6],
                   'var5': ["a", np.nan , "b", np.nan, "c"],
                   'var6': [1, np.nan , 2, np.nan, 3]
                 })


#all conditions that are connected with logical operators (&, |, etc) should be in ().
conditions = [
    (df["var1"].notna()) & (df['var2'].notna()),
    (pd.isna(df["var1"])) & (pd.isna(df["var2"])),
    (df["var1"].notna()) & (pd.isna(df["var2"])),
    (pd.isna(df["var1"])) & (df['var2'].notna())]

choices = ["Both values", np.nan, df["var1"], df["var2"]]

df['Result1'] = np.select(conditions, choices, default=np.nan)

df looks like as it should:

|    | var1   |   var2 | var3   |   var4 | var5   |   var6 | Result1     |
|---:|:-------|-------:|:-------|-------:|:-------|-------:|:------------|
|  0 | a      |      1 | nan    |    nan | a      |      1 | Both values |
|  1 | b      |      2 | x      |      4 | nan    |    nan | Both values |
|  2 | c      |    nan | nan    |    nan | b      |      2 | c           |
|  3 | nan    |      4 | y      |      5 | nan    |    nan | 4           |
|  4 | nan    |    nan | z      |      6 | c      |      3 | nan         |

Now I want to compare multiple pandas columns (in my example var1 and var2, after var3 and var4, after var5 and var6) and based on condition and choices create corresponding Results column (in my example Result1, Result2, Result3). I thought the best way should be to use vectorization (because of better performance). The df I want to get should look like:

|    | var1   |   var2 | var3   |   var4 | var5   |   var6 | Result1     | Result2     | Result3     |
|---:|:-------|-------:|:-------|-------:|:-------|-------:|:------------|:------------|:------------|
|  0 | a      |      1 | nan    |    nan | a      |      1 | Both values | nan         | Both values |
|  1 | b      |      2 | x      |      4 | nan    |    nan | Both values | Both values | nan         |
|  2 | c      |    nan | nan    |    nan | b      |      2 | c           | nan         | Both values |
|  3 | nan    |      4 | y      |      5 | nan    |    nan | 4           | Both values | nan         |
|  4 | nan    |    nan | z      |      6 | c      |      3 | nan         | Both values | Both values |

I tried this:

import pandas as pd
import numpy as np

# Creating one column from two columns. We asume that in every row there is one NaN and one value and that value fills new column.
df = pd.DataFrame({ 'var1': ['a', 'b', 'c',np.nan, np.nan],
                   'var2': [1, 2, np.nan , 4, np.nan], 
                   'var3': [np.nan , "x", np.nan, "y", "z"],
                   'var4': [np.nan , 4, np.nan, 5, 6],
                   'var5': ["a", np.nan , "b", np.nan, "c"],
                   'var6': [1, np.nan , 2, np.nan, 3]
                 })


col1 = ["var1", "var3", "var5"]
col2 = ["var2", "var4", "var6"]
colR = ["Result1", "Result2", "Result3"]

#all conditions that are connected with logical operators (&, |, etc) should be in ().
conditions = [
    (df[col1].notna()) & (df[col2].notna()),
    (pd.isna(df[col1])) & (pd.isna(df[col2])),
    (df[col1].notna()) & (pd.isna(df[col2])),
    (pd.isna(df[col1])) & (df[col2].notna())]

choices = ["Both values", np.nan, df[col1], df[col2]]

df[colR] = np.select(conditions, choices, default=np.nan)

Buy it gave me error:

ValueError: shape mismatch: objects cannot be broadcast to a single shape

Question: How to achieve my goal with vectorization (preferable because of better performance) or other method?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The issue is that pandas DataFrames force alignment on the index, but df[col1] and df[col2] have no overlapping columns.

In this case, you really want to work with the underlying numpy arrays. Also because .isnull() is the opposite of notnull you can simplify this a lot. We'll concat to add the new columns back.

col1 = ["var1", "var3", "var5"]
col2 = ["var2", "var4", "var6"]
colR = ["Result1", "Result2", "Result3"]

s1 = df[col1].isnull().to_numpy()
s2 = df[col2].isnull().to_numpy()

conditions = [~s1 & ~s2, s1 & s2, ~s1 & s2, s1 & ~s2]
choices = ["Both values", np.nan, df[col1], df[col2]]

df = pd.concat([df, pd.DataFrame(np.select(conditions, choices), columns=colR, index=df.index)], axis=1)

  var1  var2 var3  var4 var5  var6      Result1      Result2      Result3
0    a   1.0  NaN   NaN    a   1.0  Both values          NaN  Both values
1    b   2.0    x   4.0  NaN   NaN  Both values  Both values          NaN
2    c   NaN  NaN   NaN    b   2.0            c          NaN  Both values
3  NaN   4.0    y   5.0  NaN   NaN            4  Both values          NaN
4  NaN   NaN    z   6.0    c   3.0          NaN  Both values  Both values

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...