Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
214 views
in Technique[技术] by (71.8m points)

python - One hot encoding train with values not present on test

I have a train and test set stored as Dataframes. I am trying to One-hot encode nominal features on my dataset. But I have the following issues:

  1. In total there are 3 categorical features, but I don't not know what the values of each feature because the dataset is large.
  2. The test set has values that are not present on the train set, so when I do one-hot encoding, the train set should have the vectors marked as 0 for the unseen values. But as I mentioned in 1, I don't know all the features.
  3. I found I can use df = pd.get_dummies(df, prefix_sep='_') to do the one hot encoding, the command works on all categorical features, but I noticed that it moved the new features to the end of the train DataFrame, which I think is a problem because we don't know the indices of which feature. Also there is issue number 2, the new train/set should have the same indices.

Is there any automated way to do this? or a library perhaps?

EDIT

Thanks to the answers below, I was able to perform one hot encoding on many features. But the codes below gave the following issues:

  1. I think scikit-learn strips the column headers and produced the result as an array and not as a DataFrame
  2. Since the features are striped away, we have no knowledge of which vector belongs to which feature. Even if I perform df_scaled = pd.DataFrame(ct.fit_transform(data2)) to have the results stored in a Dataframe, the created dataframe df_scaledhas no headers, especially when the headers now changed after the pre-processing. Perhaps sklearn.preprocessing.OneHotEncoder has a method which keeps track of new features and their indices ??
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Instead of using pd.get_dummies, which has the drawbacks you identified, use sklearn.preprocessing.OneHotEncoder. It automatically fetches all nominal categories from your train data and then encodes your test data according to the categories identified in the training step. If there are new categories in the test data, it will just encode your data as 0's.

Example:

from sklearn.preprocessing import OneHotEncoder
import numpy as np

x_train = np.array([["A1","B1","C1"],["A2","B1","C2"]])
x_test = np.array([["A1","B2","C2"]]) # As you can see, "B2" is a new attribute for column B

ohe = OneHotEncoder(handle_unknown = 'ignore') #ignore tells the encoder to ignore new categories by encoding them with 0's
ohe.fit(x_train)
print(ohe.transform(x_train).toarray())
>>> array([[1., 0., 1., 1., 0.],
           [0., 1., 1., 0., 1.]])

To get a summary of the categories by column in the train set, do:

print(ohe.categories_)
>>> [array(['A1', 'A2'], dtype='<U2'), 
     array(['B1'], dtype='<U2'), 
     array(['C1', 'C2'], dtype='<U2')]

To map one hot encoded columns to categories, do:

print(ohe.get_feature_names())
>>> ['x0_A1' 'x0_A2' 'x1_B1' 'x2_C1' 'x2_C2']

Finally, this is how the encoder works on new test data:

print(ohe.transform(x_test).toarray())
>>> [[1. 0. 0. 0. 1.]] # 1 for A1, 0 for A2, 0 for B1, 0 for C1, 1 for C2

EDIT:

You seem to be worried about the fact that you lose the labels after doing the encoding. It is actually very easy to get back to these, just wrap the answer in a dataframe and specify the column names from ohe.get_feature_names():

pd.DataFrame(ohe.transform(x_test).toarray(), columns = ohe.get_feature_names())

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...