An uncommon approach in tackling class imbalance

2019/05/11

In supervised learning, one challenged faced by data scientists is classification class imbalance, where in a binary classification problem, instances in one class severely outnumbers instances in the other. This poses a problem as model performances may be misleading: a naive example would be to always predict negative in a 10% positive-90% negative dataset - accuracy would then be 90%, but the model would be utterly useless.

The typical approaches in alleviating class imbalance include using robust metrics such as the ROC-AUC, or performing downsampling of majority class or upsampling of minority class (e.g. SMOTE). Of course, downsampling of majority class is often frowned upon as precious datapoints are being discarded.

In many situations, these approaches work reasonably well. However, in contexts in which there is an inherent asymmetry between false positives and false negatives, these approaches are less than ideal. For example,

• In cyber security, the inability to detect an intrusion into networks (false negative) incurs a different cost as compared to a false alarm (false positive).
• In human resources, the inability to detect impending attrition of a high-potential employee (false negative) incurs a different cost as compared to incorrectly detecting said attrition (false positive).
• In a clinical setting, the inability to detect post-surgical complications (false negative) incurs a different cost as compared to incorrectly detecting a complication (false positive).

In most practical contexts, false positives and negatives incur different waste and costs. In addition, given that a prediction error is going to occur, there is often a preferred outcome or error. For instance, a healthcare practitioner would likely rather to not miss out on a post-surgical complication, than save on manpower and resources with poor prognosis.

Assuming that is the case, a simple but less commonplace approach to tackling class imbalance is to design a utility function U(m) that captures the inherent asymmetry of prediction outcomes, and use U(m) as the loss function in the ML training process for model m. Such utility functions are often used in econometrics to capture choices and preferences. Following illustrates an instructive but naive example of U(m):

U(m) would then be:

U(m) = TP(n1) + TN(n2) - 5FP(n3) - 50FN(n4)

where ni  are the respective case counts of each prediction outcome. U(m) can then be used as the loss function in tuning individual ML models, heavily penalizing false negatives.

The utility scores of each prediction outcome is a function of the expert judgement of clinicians and practitioners, in evaluating relative costs and tradeoffs between each outcome. They are also largely context-driven and should ideally differ between surgical complications, diseases, or even hospitals and departments. Finally, loss functions are general to supervised learning algorithms, i.e. U(m) can be experimented with and built into various ML models.

Of course, the challenge then, in using utility functions in machine learning, is in the design of the function - how to best to capture the inherent asymmetry and tradeoffs, and evaluate and summarize relative costs. This has to be done together with domain experts as they are the ones who can providing the expert opinion and judgement, in articulating the tradeoffs in utility. While network intrusion and employee attrition can be quantified in dollar value, the loss of a human life is definitely not so straightforward.