Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
154 views
in Technique[技术] by (71.8m points)

python - Using scikit to determine contributions of each feature to a specific class prediction

I am using a scikit extra trees classifier:

model = ExtraTreesClassifier(n_estimators=10000, n_jobs=-1, random_state=0)

Once the model is fitted and used to predict classes, I would like to find out the contributions of each feature to a specific class prediction. How do I do that in scikit learn? Is it possible with extra trees classifier or do I need to use some other model?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Update

Being more knowledgable about ML today than I was 2.5 years ago, I will now say this approach only works for highly linear decision problems. If you carelessly apply it to a non-linear problem you will have trouble.

Example: Imagine a feature for which neither very large nor very small values predict a class, but values in some intermediate interval do. That could be water intake to predict dehydration. But water intake probably interacts with salt intake, as eating more salt allows for a greater water intake. Now you have an interaction between two non-linear features. The decision boundary meanders around your feature-space to model this non-linearity and to ask only how much one of the features influences the risk of dehydration is simply ignorant. It is not the right question.

Alternative: Another, more meaningful, question you could ask is: If I didn't have this information (if I left out this feature) how much would my prediction of a given label suffer? To do this you simply leave out a feature, train a model and look at how much precision and recall drops for each of your classes. It still informs about feature importance, but it makes no assumptions about linearity.

Below is the old answer.


I worked through a similar problem a while back and posted the same question on Cross Validated. The short answer is that there is no implementation in sklearn that does all of what you want.

However, what you are trying to achieve is really quite simple, and can be done by multiplying the average standardised mean value of each feature split on each class, with the corresponding model._feature_importances array element. You can write a simple function that standardises your dataset, computes the mean of each feature split across class predictions, and does element-wise multiplication with the model._feature_importances array. The greater the absolute resulting values are, the more important the features will be to their predicted class, and better yet, the sign will tell you if it is small or large values that are important.

Here's a super simple implementation that takes a datamatrix X, a list of predictions Y and an array of feature importances, and outputs a JSON describing importance of each feature to each class.

def class_feature_importance(X, Y, feature_importances):
    N, M = X.shape
    X = scale(X)

    out = {}
    for c in set(Y):
        out[c] = dict(
            zip(range(N), np.mean(X[Y==c, :], axis=0)*feature_importances)
        )

    return out

Example:

import numpy as np
import json
from sklearn.preprocessing import scale

X = np.array([[ 2,  2,  2,  0,  3, -1],
              [ 2,  1,  2, -1,  2,  1],
              [ 0, -3,  0,  1, -2,  0],
              [-1, -1,  1,  1, -1, -1],
              [-1,  0,  0,  2, -3,  1],
              [ 2,  2,  2,  0,  3,  0]], dtype=float)

Y = np.array([0, 0, 1, 1, 1, 0])
feature_importances = np.array([0.1, 0.2, 0.3, 0.2, 0.1, 0.1])
#feature_importances = model._feature_importances

result = class_feature_importance(X, Y, feature_importances)

print json.dumps(result,indent=4)

{
    "0": {
        "0": 0.097014250014533204, 
        "1": 0.16932975630904751, 
        "2": 0.27854300726557774, 
        "3": -0.17407765595569782, 
        "4": 0.0961523947640823, 
        "5": 0.0
    }, 
    "1": {
        "0": -0.097014250014533177, 
        "1": -0.16932975630904754, 
        "2": -0.27854300726557779, 
        "3": 0.17407765595569782, 
        "4": -0.0961523947640823, 
        "5": 0.0
    }
}

The first level of keys in result are class labels, and the second level of keys are column-indices, i.e. feature-indices. Recall that large absolute values corresponds to importance, and the sign tells you whether it's small (possibly negative) or large values that matter.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...