Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
624 views
in Technique[技术] by (71.8m points)

scikit learn - Issue with OneHotEncoder for categorical features

I want to encode 3 categorical features out of 10 features in my datasets. I use preprocessing from sklearn.preprocessing to do so as the following:

from sklearn import preprocessing
cat_features = ['color', 'director_name', 'actor_2_name']
enc = preprocessing.OneHotEncoder(categorical_features=cat_features)
enc.fit(dataset.values)

However, I couldn't proceed as I am getting this error:

    array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: PG

I am surprised why it is complaining about the string as it is supposed to convert it!! Am I missing something here?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

If you read the docs for OneHotEncoder you'll see the input for fit is "Input array of type int". So you need to do two steps for your one hot encoded data

from sklearn import preprocessing
cat_features = ['color', 'director_name', 'actor_2_name']
enc = preprocessing.LabelEncoder()
enc.fit(cat_features)
new_cat_features = enc.transform(cat_features)
print new_cat_features # [1 2 0]
new_cat_features = new_cat_features.reshape(-1, 1) # Needs to be the correct shape
ohe = preprocessing.OneHotEncoder(sparse=False) #Easier to read
print ohe.fit_transform(new_cat_features)

Output:

[[ 0.  1.  0.]
 [ 0.  0.  1.]
 [ 1.  0.  0.]]

EDIT

As of 0.20 this became a bit easier, not only because OneHotEncoder now handles strings nicely, but also because we can transform multiple columns easily using ColumnTransformer, see below for an example

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import numpy as np

X = np.array([['apple', 'red', 1, 'round', 0],
              ['orange', 'orange', 2, 'round', 0.1],
              ['bannana', 'yellow', 2, 'long', 0],
              ['apple', 'green', 1, 'round', 0.2]])
ct = ColumnTransformer(
    [('oh_enc', OneHotEncoder(sparse=False), [0, 1, 3]),],  # the column numbers I want to apply this to
    remainder='passthrough'  # This leaves the rest of my columns in place
)
print(ct2.fit_transform(X)) # Notice the output is a string

Output:

[['1.0' '0.0' '0.0' '0.0' '0.0' '1.0' '0.0' '0.0' '1.0' '1' '0']
 ['0.0' '0.0' '1.0' '0.0' '1.0' '0.0' '0.0' '0.0' '1.0' '2' '0.1']
 ['0.0' '1.0' '0.0' '0.0' '0.0' '0.0' '1.0' '1.0' '0.0' '2' '0']
 ['1.0' '0.0' '0.0' '1.0' '0.0' '0.0' '0.0' '0.0' '1.0' '1' '0.2']]

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...