Author: Janani Ravi
“It is hardly an AI winter, but a chill is definitely in the air.”
The Wall Street Journal was at its perceptive best when it expressed this view. A similar thought permeated its take on the early hopes pinned on artificial intelligence in healthcare: “A decade later, reality has fallen short of that promise.”
These are valid observations, and indeed they are shared by many who feel disappointed and frustrated by the machine learning models and AI systems that promised so much—but that now, once built, seem to underwhelm with their performance. All of these folks have a point, but it is also possible to address some of these concerns by keeping in mind some realities of working with AI/ML models and systems.
Here are three I think are especially important for data scientists to regularly consider.
We must address concept drift
To begin with, it is worth understanding that ML models are at their best just before being deployed to production. All too often, it’s all downhill from there. They have just been trained with the latest available data, and hopefully accurately reflect the state of the world at that particular instant. However, once the model has been trained and deployed, the world will keep changing, and more data will become available every instant, but our model will still sit there, not incorporating these changes—unless it is re-trained and kept fresh.
This idea that a model will slowly get out-of-touch with new realities is called concept drift, and helps explain a powerful driver of model underperformance. Thankfully, this particular one is relatively easy to fix: The model needs to be nurtured, re-trained and even entirely rebuilt as needed. This might bely some hopes that AI/ML models would be “write once, use forever.” But realistically, going by experience with other types of software systems, that was never going to be the case—not any time soon, anyway.
We must be diligent about avoiding the temptation to overfit
Another more insidious driver of model underperformance is overfitting. This practice often comes down to the institutional and organizational imperatives around getting new models accepted in an enterprise setting.
Say you are a perfectly competent data scientist who knows most of the pitfalls of overfitting, and even how to avoid them. However, you face a challenge: How do you get your model into widespread use on day one, before that model has built up any track record making predictions in the real world? The temptation (whether indulged consciously or unconsciously) is to have a really awesome set of performance statistics on historical data. Nothing eases the model’s acceptance quite like an amazing performance in backtesting. Unfortunately, that great performance on past data is all too often obtained by overfitting, or relying too much on specific, non-generalizable patterns that will not help with forward-looking predictions.
The old adage “sweat in training or bleed in battle” is worth keeping in mind here. It is often better, from a long-term perspective, to sweat over poor results in training your ML models, than to see them flop and flounder spectacularly once deployed for live prediction.
We must be rigorous about keeping our code paths consistent
A third reason for models that over promise during training or validation but then under deliver down the road is training-serving skew.
When a model is being trained or validated, the data flowing in is often in a stable, predictable and easy-to-consume format (for instance, data for training or hyperparameter tuning is usually in a database, i.e. a batch store). Missing data has been analyzed and accounted for meticulously and no dodgy assumptions have been made while doing so. Everything is clean, neat and tidy. Then the model goes live, and suddenly, things are different.
The problem instances—the data from which the model has to make its predictions—are no longer in a batch store, but instead come streaming in. They may have important attributes missing, and in ways that simply could not have been accounted for during training. Terrible simplifying assumptions might get made, like taking a simple average rather than a volume-weighted average of stock prices over an interval, for example. The next thing you know, the model predictions seem terrible, and you as the model developer are left facing angry users.
The solution here is simple, but not easy: The code paths followed by data should be the same, no matter whether it is training, validation or actual live use. This also underscores the need to nurture models after they go live: We can and should keep track of the unexpected use-cases that lead to these “garbage in, garbage out” situations, and we should go back and fix the models so that they do better the next time.
Reversing the chill
Getting the best out of AI/ML models involves doing the work. The movie does not end when the model goes live with an awesome validation score. On the contrary, that’s where a lot of the boring, but important, work actually begins.