Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
290 views
in Technique[技术] by (71.8m points)

classification - Understanding Spark RandomForest featureImportances results

I'm using RandomForest.featureImportances but I don't understand the output result.

I have 12 features, and this is the output I get.

I get that this might not be an apache-spark specific question but I cannot find anywhere that explains the output.

// org.apache.spark.mllib.linalg.Vector = (12,[0,1,2,3,4,5,6,7,8,9,10,11],
 [0.1956128039688559,0.06863606797951556,0.11302128590305296,0.091986700351889,0.03430651625283274,0.05975817050022879,0.06929766152519388,0.052654922125615934,0.06437052114945474,0.1601713590349946,0.0324327322375338,0.057751258970832206])
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Given a tree ensemble model, RandomForest.featureImportances computes the importance of each feature.

This generalizes the idea of "Gini" importance to other losses, following the explanation of Gini importance from "Random Forests" documentation by Leo Breiman and Adele Cutler, and following the implementation from scikit-learn.

For collections of trees, which includes boosting and bagging, Hastie et al. suggests to use the average of single tree importances across all trees in the ensemble.

And this feature importance is calculated as followed :

  • Average over trees:
    • importance(feature j) = sum (over nodes which split on feature j) of the gain, where gain is scaled by the number of instances passing through node
    • Normalize importances for tree to sum to 1.
  • Normalize feature importance vector to sum to 1.

References: Hastie, Tibshirani, Friedman. "The Elements of Statistical Learning, 2nd Edition." 2001. - 15.3.2 Variable Importance page 593.

Let's go back to your importance vector :

val importanceVector = Vectors.sparse(12,Array(0,1,2,3,4,5,6,7,8,9,10,11), Array(0.1956128039688559,0.06863606797951556,0.11302128590305296,0.091986700351889,0.03430651625283274,0.05975817050022879,0.06929766152519388,0.052654922125615934,0.06437052114945474,0.1601713590349946,0.0324327322375338,0.057751258970832206))

First, let's sort this features by importance :

importanceVector.toArray.zipWithIndex
            .map(_.swap)
            .sortBy(-_._2)
            .foreach(x => println(x._1 + " -> " + x._2))
// 0 -> 0.1956128039688559
// 9 -> 0.1601713590349946
// 2 -> 0.11302128590305296
// 3 -> 0.091986700351889
// 6 -> 0.06929766152519388
// 1 -> 0.06863606797951556
// 8 -> 0.06437052114945474
// 5 -> 0.05975817050022879
// 11 -> 0.057751258970832206
// 7 -> 0.052654922125615934
// 4 -> 0.03430651625283274
// 10 -> 0.0324327322375338

So what does this mean ?

It means that your first feature (index 0) is the most important feature with a weight of ~ 0.19 and your 11th (index 10) feature is the least important in your model.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...