18

使用feature Importance进行特征选择

 4 years ago
source link: https://www.biaodianfu.com/feature-importance.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

在前一篇机器学习之特征选择的文章中讲到了树模型中GBDT也可用来作为基模型进行特征选择。今天在此基础上进行拓展,介绍除决策树外用的比较多的XGBoost、LightGBM。

DecisionTree

决策树的feature_importances_属性,返回的重要性是按照决策树种被用来分割后带来的增益(gain)总和进行返回。

The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance.

关于信息增益(Gain)相关介绍可以决策树简介。

参考链接: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.feature_importances_

GradientBoosting和ExtraTrees与DecisionTree类似。

XGBoost

get_score(fmap='', importance_type='weight')
Get feature importance of each feature. Importance type can be defined as:
‘weight’: the number of times a feature is used to split the data across all trees.
‘gain’: the average gain across all splits the feature is used in.
‘cover’: the average coverage across all splits the feature is used in.
‘total_gain’: the total gain across all splits the feature is used in.
‘total_cover’: the total coverage across all splits the feature is used in.

其中:

  • weight:该特征被选为分裂特征的次数。
  • gain:该特征的带来平均增益(有多棵树)。在tree中用到时的gain之和/在tree中用到的次数计数。gain = total_gain / weight
  • cover:该特征对每棵树的覆盖率。
  • total_gain:在所有树中,某特征在每次分裂节点时带来的总增益
  • total_cover:在所有树中,某特征在每次分裂节点时处理(覆盖)的所有样例的数量。

参考链接: https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.Booster.get_score

LightGBM

feature_importance(importance_type='split', iteration=None)
Get feature importances.
 importance_type (string, optional (default="split")) – How the importance is calculated. If “split”, result contains numbers of times the feature is used in a model. If “gain”, result contains total gains of splits which use the feature.
 iteration (int or None, optional (default=None)) – Limit number of iterations in the feature importance calculation. If None, if the best iteration exists, it is used; otherwise, all trees are used. If <= 0, all trees are used (no limits).

其中:

  • split就是特征在所有决策树中被用来分割的总次数。
  • gain就是特征在所有决策树种被用来分割后带来的增益(gain)总和

参考链接: https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.Booster.html#lightgbm.Booster.feature_importance


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK