python - One hot encoding of string categorical features

Question

Welcome To Ask or Share your Answers For Others

python - One hot encoding of string categorical features

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - One hot encoding of string categorical features

I'm trying to perform a one hot encoding of a trivial dataset.

data = [['a', 'dog', 'red']
        ['b', 'cat', 'green']]

What's the best way to preprocess this data using Scikit-Learn?

On first instinct, you'd look towards Scikit-Learn's OneHotEncoder. But the one hot encoder doesn't support strings as features; it only discretizes integers.

So then you would use a LabelEncoder, which would encode the strings into integers. But then you have to apply the label encoder into each of the columns and store each one of these label encoders (as well as the columns they were applied on). And this feels extremely clunky.

So, what's the best way to do it in Scikit-Learn?

Please don't suggest pandas.get_dummies. That's what I generally use nowadays for one hot encodings. However, its limited in the fact that you can't encode your training / test set separately.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:44:43+0000

If you are on sklearn>0.20.dev0

In [11]: from sklearn.preprocessing import OneHotEncoder
    ...: cat = OneHotEncoder()
    ...: X = np.array([['a', 'b', 'a', 'c'], [0, 1, 0, 1]], dtype=object).T
    ...: cat.fit_transform(X).toarray()
    ...: 
Out[11]: array([[1., 0., 0., 1., 0.],
           [0., 1., 0., 0., 1.],
           [1., 0., 0., 1., 0.],
           [0., 0., 1., 0., 1.]])

If you are on sklearn==0.20.dev0

In [30]: cat = CategoricalEncoder()

In [31]: X = np.array([['a', 'b', 'a', 'c'], [0, 1, 0, 1]], dtype=object).T

In [32]: cat.fit_transform(X).toarray()
Out[32]:
array([[ 1.,  0., 0.,  1.,  0.],
       [ 0.,  1.,  0.,  0.,  1.],
       [ 1.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  1.,  0.,  1.]])

Another way to do it is to use category_encoders.

Here is an example:

% pip install category_encoders
import category_encoders as ce
le =  ce.OneHotEncoder(return_df=False, impute_missing=False, handle_unknown="ignore")
X = np.array([['a', 'dog', 'red'], ['b', 'cat', 'green']])
le.fit_transform(X)
array([[1, 0, 1, 0, 1, 0],
       [0, 1, 0, 1, 0, 1]])

Categories

python - One hot encoding of string categorical features

python - One hot encoding of string categorical features

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags