python - How to use sklearn Column Transformer?

Question

Welcome To Ask or Share your Answers For Others

python - How to use sklearn Column Transformer?

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - How to use sklearn Column Transformer?

I'm trying to convert categorical value (in my case it is country column) into encoded value using LabelEncoder and then with OneHotEncoder and was able to convert the categorical value. But i'm getting warning like OneHotEncoder 'categorical_features' keyword is deprecated "use the ColumnTransformer instead." So how i can use ColumnTransformer to achieve same result ?

Below is my input data set and the code which i tried

Input Data set

Country Age Salary
France  44  72000
Spain   27  48000
Germany 30  54000
Spain   38  61000
Germany 40  67000
France  35  58000
Spain   26  52000
France  48  79000
Germany 50  83000
France  37  67000


import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

#X is my dataset variable name

label_encoder = LabelEncoder()
x.iloc[:,0] = label_encoder.fit_transform(x.iloc[:,0]) #LabelEncoder is used to encode the country value
hot_encoder = OneHotEncoder(categorical_features = [0])
x = hot_encoder.fit_transform(x).toarray()

And the output i'm getting as, How can i get the same output with column transformer

0(fran) 1(ger) 2(spain) 3(age)  4(salary)
1         0       0      44        72000
0         0       1      27        48000
0         1       0      30        54000
0         0       1      38        61000
0         1       0      40        67000
1         0       0      35        58000
0         0       1      36        52000
1         0       0      48        79000
0         1       0      50        83000
1         0       0      37        67000

i tried following code

from sklearn.compose import ColumnTransformer, make_column_transformer

preprocess = make_column_transformer(

    ( [0], OneHotEncoder())
)
x = preprocess.fit_transform(x).toarray()

i was able to encode country column with the above code, but missing age and salary column from x varible after transforming

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:29:20+0000

It is a bit strange to encode continuous data as Salary. It makes no sense unless you have binned your salary to certain ranges/categories. If I where you I would do:

import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder



numeric_features = ['Salary']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['Age','Country']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

from here you can pipe it with a classifier e.g.

clf = Pipeline(steps=[('preprocessor', preprocessor),
                  ('classifier', LogisticRegression(solver='lbfgs'))])

Use it as so:

clf.fit(X_train,y_train)

this will apply the preprocessor and then pass transfomed data to the predictor.

Updates:

If we want to select data types on fly, we can modify our preprocessor to use column selector by data dtypes:

from sklearn.compose import make_column_selector as selector

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, selector(dtype_include="numeric")),
        ('cat', categorical_transformer, selector(dtype_include="category"))])

Using GridSearch


param_grid = {
    'preprocessor__num__imputer__strategy': ['mean', 'median'],
    'classifier__C': [0.1, 1.0, 10, 100],
    'Classifier__solver': ['lbfgs', 'sag'],
}

grid_search = GridSearchCV(clf, param_grid, cv=10)
grid_search.fit(X_train,y_train)

Categories

python - How to use sklearn Column Transformer?

python - How to use sklearn Column Transformer?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Updates:

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags