Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
469 views
in Technique[技术] by (71.8m points)

python - How to drop columns which have same values in all rows via pandas or spark dataframe?

Suppose I've data similar to following:

  index id   name  value  value2  value3  data1  val5
    0  345  name1    1      99      23     3      66
    1   12  name2    1      99      23     2      66
    5    2  name6    1      99      23     7      66

How can we drop all those columns like (value, value2, value3) where all rows have the same values, in one command or couple of commands using python?

Consider we have many columns similar to value, value2, value3...value200.

Output:

   index    id  name   data1
       0   345  name1    3
       1    12  name2    2
       5     2  name6    7
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

What we can do is use nunique to calculate the number of unique values in each column of the dataframe, and drop the columns which only have a single unique value:

In [285]:
nunique = df.nunique()
cols_to_drop = nunique[nunique == 1].index
df.drop(cols_to_drop, axis=1)

Out[285]:
   index   id   name  data1
0      0  345  name1      3
1      1   12  name2      2
2      5    2  name6      7

Another way is to just diff the numeric columns, take abs values and sums them:

In [298]:
cols = df.select_dtypes([np.number]).columns
diff = df[cols].diff().abs().sum()
df.drop(diff[diff== 0].index, axis=1)
?
Out[298]:
   index   id   name  data1
0      0  345  name1      3
1      1   12  name2      2
2      5    2  name6      7

Another approach is to use the property that the standard deviation will be zero for a column with the same value:

In [300]:
cols = df.select_dtypes([np.number]).columns
std = df[cols].std()
cols_to_drop = std[std==0].index
df.drop(cols_to_drop, axis=1)

Out[300]:
   index   id   name  data1
0      0  345  name1      3
1      1   12  name2      2
2      5    2  name6      7

Actually the above can be done in a one-liner:

In [306]:
df.drop(df.std()[(df.std() == 0)].index, axis=1)

Out[306]:
   index   id   name  data1
0      0  345  name1      3
1      1   12  name2      2
2      5    2  name6      7

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...