Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
595 views
in Technique[技术] by (71.8m points)

scikit learn - Understanding max_features parameter in RandomForestRegressor

While constructing each tree in the random forest using bootstrapped samples, for each terminal node, we select m variables at random from p variables to find the best split (p is the total number of features in your data). My questions (for RandomForestRegressor) are:

1) What does max_features correspond to (m or p or something else)?

2) Are m variables selected at random from max_features variables (what is the value of m)?

3) If max_features corresponds to m, then why would I want to set it equal to p for regression (the default)? Where is the randomness with this setting (i.e., how is it different from bagging)?

Thanks.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Straight from the documentation:

[max_features] is the size of the random subsets of features to consider when splitting a node.

So max_features is what you call m. When max_features="auto", m = p and no feature subset selection is performed in the trees, so the "random forest" is actually a bagged ensemble of ordinary regression trees. The docs go on to say that

Empirical good default values are max_features=n_features for regression problems, and max_features=sqrt(n_features) for classification tasks

By setting max_features differently, you'll get a "true" random forest.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...