Been wanting to do this consolidation for some time, so here goes. I won’t be touching on any language-specific questions (e.g. packages or functions), as I don’t believe they are relevant in this Google/Stack Overflow era - kind of missing the forest for the trees. Also, won’t be going over basic questions like “What is logistic regression” or “How does collaborative filtering works” or the like.
Rather, I want to be more focused on ML concepts that are closely tied to the consumption of ML models by business. Once ML matures in the market, the next more well-sought after set of skills should be closely related to translation of business requirements, model deep-diving, ML pipeline management (roadmapping), and curation of models.
Questions will be categorized according to the following:
- Business understanding - developing business understanding in a short amount of time is a key skillset for a data scientist, in order to build superb ML models. In addition, managing your customers in the context between your ML practice and their business expertise is key to project success
- Statistics - contrary to plenty of data scientists out there, I continue to believe statistics is a pre-requisite core skillset to being a good data scientist (hence even the name of my blog)
- Model Building
- Model Selection - it’s easy to select models based on metrics like accuracy or ROC-AUC. What happens if there are additional concerns or complications from the business?
- Model Maintenance - a data scientist’s job doesn’t stop at building the models, it should also include keeping our models healthy for business consumption
1. As data scientists, a big part of our job is to understand the business in which our usecases originate from, and glean subtle ways in which our analysis and modelling could be influenced or impacted. This typically starts from multiple requirements gathering meetings we have with our customers. Given a predictive usecase, what would be some of the questions that you ask your customers in order to arrive at the relevant information?
Identify drivers, (strength of) business assumptions, get clarity of data availability of drivers, span of historical data available, exceptions, macro-environment changes and their impact, availability of external data
2. In the initial stages of modelling/exploratory data analysis, how would you validate your new found understanding the business and validate them with your customers?
Construct preliminary models, use customers’ input directly as feature selection means
3. Suppose your analysis and/or models are giving plenty of output that deviates substantially from your customers’ expectations - for example, unintuitive feature importance from a high quality model, or unexpected predicted values. How would you develop a compromise between good ML practice and their business knowledge and negotiate your way through the project?
4. What is customer success and why is it important in data science projects?
In many data science units, including both inward- and outward-facing units, customer success (CS) is often neglected as a form of “post-sales” activity. CS refers to the continued and sustained effort to ensure that the deliverables, regardless of models or tools etc., become an integral part of the customer’s processes or workflows.
CS is about maximizing the adoption and usage of the deliverables, to ensure that the customer is successful in operating the tools that have been delivered to them.
While it seems reasonable to scope a project in terms of producing the deliverables subsequently ending the engagement thereafter, this is an unhealthy way of running any projects, including data science. CS is important because it is an indication of the return of investment for a project - for a outward-facing data science unit, CS and post-sales activities maximizes customer experience and quality of account, leading to sales/presales pipeline development. This is relevant for both software as well as professional services vendors. For a inward-facing unit, user adoption and value delivered can be directly measured to assess true value throughput of the data science unit in serving the company.
As it turns out, CS is a function of multiple factors, including customer experience, user experience, and change management. As data scientists, there are a couple of things which we can do to directly contribute to CS:
- During project scoping and requirements gathering, ensure that the project is well-scoped in time and space, in terms of e.g. unit of analysis, target variable, number of models, modelling criteria and success criteria. Strive to deliver clear understanding on data science terminologies to business stakeholders and users, minimize ambiguity and align expectations. For example, ML projects are typically pitched to create value in accuracy and/or automation. Ensure this expectation on accuracy and/or automation is aligned.
- Reduce the use of ad-hoc data (e.g. standalone, manually curated/maintained spreadsheets) where possible. If these datapoints turn out to be valuable to the models, create scripts or workflows to ensure data refreshes can be done as hassle-free as possible.
- Build confidence with customers by performing e.g. 1 to 2 rounds of exploratory data analysis (EDA). Illustrating and validating business assumptions, as well as retrieving any data artifacts or surprising insights, such as trends or correlations, improves customer experience and the subsequent modelling process. Present these pre-modelling results to customers for validation and communication.
- Ensure the modelling process is clearly illustrated in an “explain-like-I’m-5” manner. Let customers understand why certain features are dropped, or why a certain feature is engineered in a particular manner. Customers should be able to understand feature importance in a model. Most importantly, customers should not feel that models are black-box as this increases the fear of the unknown and reduces model adoption.
- Document all work products throughout the project, from business assumptions, data preparation, modelling, to model deployment. This ensure reproducibility.
- Finally, develop a reasonable model monitoring and maintenance process. Both data refreshes as well as model refreshes should be not too frequent, manual or time-consuming. A reasonable maintenance cadence maximizes model adoption and customer success.
1. How do you detect statistical interactions in a dataset, and how would that affect your modelling?
Effect modification. Classical way is to use multiplicative terms in modelling, though not always scalable - O(n^2)-type complexity.
2. How do you detect confounding in a dataset, and how would that affect your modelling? How does confounding differ from multicollinearity?
Tackle from epidemiology standpoint
3. Would it be apt to use p-values, either from univariate- or multivariate-type analysis, as indications of feature importance or as a means of feature selection? If yes, how would you use it? If no, what are your considerations?
No. Definition of p-values and null hypothesis differs subtly from feature importance, but with substantial impact in interpretation.
4. What is bias, and how would bias affect your analysis, modelling and interpretations of data?
Selection bias, information bias
5. What is the bias-variance tradeoff, and how would the tradeoff affect your modelling?
1. Why is there a general tradeoff between model performance and interpretability? Suppose you constructed a high performing model that is essentially a black box e.g. a deep learning model. How would you present the model to your customers in an more interpretable manner?
Complexity of business environment, opaque drivers with unknown interactions. Build simple regression or decision tree models with important features from black box as proxy for interpretation.
2. Given a reasonably clean dataset in a predictive usecase, what are the tasks standing between the dataset and a good model? Which are the tasks that would, in principle, take the longest time to perform?
Feature transformation, feature engineering, hyperparameter tuning, model selection. Feature engineering should be the challenging and take the longest. A good feature set can easily beat a well-tune model because the former is closer to the true underlying business context than the latter.
3. Having a dataset with temporal element means that it lends itself to both typical supervised learning models as well as time series models. Again, in a predictive usecase, how would you decide which is a better way to reach quality predictions?
4. Would your general modelling methodology differ substantially or at all if your customers are seeking explanations instead of predictions?
Broadly speaking, suppose we consider the divide between tree-based methods against linear regression methods. In the construction of a decision tree, the training dataset is divided into multiple discrete p-dimensional subspaces, where p is the number of features. To transcend from one subspace to its adjacent neighbor, one would have to make multiple unit increments in one direction towards a given subspace boundary, until the boundary is transcended. This means that throughout the unit increments, the predicted value of a test case remains the same, until the boundary is transcended. Contrast this to a linear regression-type method, where we can see the impact on predicted value for every unit increment of a given feature with its given coefficient. Intuitively, this could better aid in explanations and understanding, and can be used to perform sensitivity or what-if analysis for deeper insights.
In addition, we could consider fit-emphasizing metrics such as R2 instead of prediction-emphasizing metrics such as accuracy for model evaluation.
5. In a multiclass classification problem, it is typically challenging to develop a high performing model. How would you tackle a multiclass classification problem?
Collapse class, multi-tier modelling, though errors in classification will cascade. If class is ordinal, measure error wrt distance to the correct class rather than in absolute terms.
1. Given two black box models, how would you choose one over the other?
Generally, we can consider the following: i) model performances on the testing sets, ii) model performances on fitted training sets, iii) feature importance, and iv) model simplicity.
i) Model performances on the testing set is obviously important as it directly points to the model’s ability to generalize to unseen cases.
ii) Model performances on the fitted training sets can given an indication on the fundamental quality of the model itself. For example, if training performance outweigh testing performance by a large margin, then overfitting could be a concern. On its own, a low training performance indicates underfitting, while extremely high training performance could indicate target leakage.
iii) Feature importance illustrates the relative weightages of underlying features in the model and how they contribute to reaching a prediction or outcome. In a scenario where a strong model outperforms a weaker model but with a somewhat bizarre and unintuitive set of feature importance, it becomes a business decision to make a prudent model selection. This is especially important in industries e.g. banking, where decisions to reject a loan may need to be entirely transparent.
iv) Finally, Occam’s razor would be a good heuristic to apply - for a given model performance, the simpler a model is, the better it is for human and business consumption.
2. In usecases that typically exhibit class imbalance and yet not the extent where anomaly detection algorithms are appropriate, how would we ensure that the models selected are adequate for business consumption?
In classification problems with class imbalance, one of the first things to experiment is upsampling the minority class or downsampling the majority class. Typically, the former is preferred as then we don’t lose training samples. Subsequently, we can then follow up with the modelling, and more importantly, the selection of a metric robust to class imbalance, such as ROC-AUC or sensitivity or specificity per se. This is trivial.
More importantly and very often neglected is a deliberate consideration of the risk or cost of a wrong prediction, especially in a class imbalance setting. For instance, suppose we are tackling an employee attrition usecase, where we are predicting if an employee is about to leave the company within the next 3 months, as an example. This is of course a typical class imbalance problem - most reasonably large companies have about 10 to 20% attrition rate as a healthy number.
However, suppose the usecase is specific to high potential individuals within the employee population - employees who are earmarked by HR and management as e.g. successors of critical roles in the company (hence high potential). In this context, a false negative, i.e. wrongly predicting no attrition, becomes a costly mistake. In contrast, a false positive is a false alarm and is much less costly.
How do we capture this asymmetry in cost/risk in our model evaluation process? A solution would be develop a utility function U(m), and assign relative utility to our prediction outcomes, as below:
Prediction outcome Utility True positive 1 True negative 1 False positive -5 False negative -50
Then, the utility of a model m would be
U(m) = TP(n1) + TN(n2) - 5FP(n3) - 50FN(n4)
where ni are the respective case counts of each prediction outcome. Model tuning and validation can then be done by maximizing utility.
3. In most projects, multiple models are selected to be production-grade and their respective output are combined. What are the different ways we can combine multiple models?
There are various ways in which models can be combined, such as boosting, ensemble learning and stacking. I would like to focus here on a less well-known approach known as analytical hierarchy process (AHP). In essence, AHP is utility-driven way of combining different types of models to reach a single conclusion. The best way to illustrate AHP is the use of an example.
Consider a HR analytics problem statement: how can the impact of employee attrition to the company be minimized?
Typically, in HR analytics, we can say that primary outcomes that are impactful and can evaluated are performance, potential, and attrition. We can therefore formulate three usecases to culminate into our problem statement:
- Predict the short-term performance of a given employee
- Predict the potential/runway of a given employee
- Predict the attrition risk of a given employee
This necessarily means that we would at least have three different models, with three different target variables, to construct. How would we then combine these models to address the problem statement of minimizing employee attrition? This is done using AHP to build a utility function, by assign utilities or weights to the predicted probabilities of each model. Without going through the details of AHP, following are examples of simple utility functions:
impact of attrition = attrition risk * short-term performance + 5 * attrition risk * potential
impact of attrition = attrition risk * employee rank * (short-term performance + potential)
impact of attrition = attrition risk * (40 - tenure) * potential
Note that we can include additional layers of modelling in the AHP to capture specific intricacies and requirements - for example, the utility function need not be a linear combination of weights, but a decision tree or a matrix.
In general, AHP is an extremely powerful method to combine multiple models to address a large and encompassing problem statement.
1. What are the points of consideration in deciding when and how to update or refresh a model?
In general, the performance of a model erodes over time, with changes in business environment. Model refresh is therefore an important consideration even before commencement of any data science project.
There are a number of factors when considering when and how to refresh a model:
- Availability of data - when can fresh datapoints be captured and sufficiently verified? For example, fresh datapoints could be captured on a weekly or monthly basis. This implies that models can be refreshed at most at a weekly or monthly basis. On the other hand, with monthly captures but only quarterly validation means that models can be refreshed at most at a quarterly basis.
- Changes in business environment - has there been a major change in the environment? For example, new policies announced or implemented by authorities, entry of new competitors, new product launches, major news or events could all justify model refreshes. Of course, this is also dependent on data availability.
- Refresh/no refresh cost-benefit analysis - for example, a model refresh could require collection of updated datapoints with some man effort in the executing the refresh, which leads some dollar value cost. On the other hand, the benefit of the refresh is deemed to be of low consequence and unlikely to steer decision making or operations in significant manner, which leads to some dollar value benefit.
2. What are the ways you can refresh a model (from ML perspective, not engineering perspective)?
The simplest way to refresh a model is the simply rerun the training algorithm on the updated data with a fresh round of hyperparameter tuning. The next order of complexity would be to re-execute the training process with the same feature set (with updated data), but re-experimenting with the various training algorithms. Layering on that could be additional feature engineering for potentially better model performances.
Finally, the most labour-intensive form of refresh would be to rework the entire model from scratch, starting again with requirements gathering and assumptions validation, with possibly new feature sets.