In many machine learning models, feature importance or variable importance is an important output from the model as it informs us about the relative or absolute importance of each feature in contributing to the model. More specifically, feature importance tells us which are the features that are highly differentiating, in the case of classification, and which are those that are not.
However, one shortfall of feature importance is that it’s global in nature - it informs the data scientist about the overall strength of the features as a whole in the dataset. What if something more granular and refined is required? Just because a feature is high up on the feature importance list does not mean that it’s always important.
Logistic regression as an example
Suppose you have some logistic regression model in the form
logit(p) = β0 + β1x1 + … + βpxp
For a given test case with p features, you would get the predicted probability by substituting each actualized feature value into the model (and do the inverse logit transformation). Naturally, different test cases would have different actualized feature value, summing up to different predicted probabilities.
Therefore, it’s intuitive to think about how a feature contributes to the predicted probability of a given test case:
feature contribution of xk = |βkxk| / (|β0| + Σ|βixi|) ∈ [0,1]
With this, we can say the xk contributes to the predicted probability of a given test case for some proportion or percentage, i.e. the feature contribution.
Generalizing to tree-based models
This way of evaluating individual feature contribution can be generalized to beyond linear models. For example, in a decision tree, a given test case would have a given prediction pathway that it takes down the decision tree - passing through to multiple junctions in the tree. The gain or loss in predicted probabilities or predicted values can be tracked accordingly, leading to a similar notion of feature contribution.
Moreover, this can further generalized to ensemble models, such as Random Forest, ExtraTrees and even XGBoost.
I leave you with these two blog posts and the treeinterpreter package in Python as it has already been explored:
- http://blog.datadive.net/interpreting-random-forests/
- https://blog.datadive.net/random-forest-interpretation-conditional-feature-contributions/
- https://github.com/andosa/treeinterpreter
That’s all for this short post, hope it helps in bringing you to think slightly deeper about feature importance.