Machine Learning in Production: How to Monitor, Maintain, and Improve ML Models

Slava Kulagin, Data Scientist, ML Researcher

How to Monitor & Maintain ML Models in Production | Integrio

Machine learning (ML) models tend to drift and decay over time. They may also return bizarre, unpredictable results for edge cases that weren’t anticipated.

So, if you opt for the “deploy and forget” approach, that deterioration will probably go unnoticed. There won’t be an error message or unexpected downtime; the model will just keep silently failing.

That’s why, once your ML model is up and running, your work has just begun. Now, you need to keep an eye on your model’s performance and put in the work to maintain and improve it.

At Integrio, we’re well-acquainted with both the strengths and limitations of AI models. We established an AI and ML department in 2017 and have since built, deployed, and maintained a variety of ML models. Our AI-assisted coding experts, in turn, routinely turn to these models to streamline our cloud migration services and other tasks.

Here’s everything you need to know about monitoring, maintaining, and improving ML models in production, based on our experience.

Monitoring Machine Learning Models in Production

Before you can jump into tweaking the ML model, you need to know what exactly you’ll be tweaking it for. That’s where model monitoring comes in. It involves tracking and analyzing four types of metrics to alert you to potential issues with model quality, reliability, and more.

Software System Health

Yes, monitoring the ML model closely is crucial, but you can’t ignore that it’s part of a larger software system. So, besides ensuring accurate model output, you also need to keep an eye on the whole system’s stability and performance. The nitty-gritty of it is similar to any other application in production.

Metrics to monitor:

Latency
Error rates
Memory usage
Disk usage
Average response time
Application and server CPU usage
Application availability
Throughput

Data Quality and Integrity

“Garbage in, garbage out.” This adage describes a real danger to ML model accuracy: incorrect or incomplete input data will lead to unreliable output.

A lot can go wrong with the input data. So, ensuring its quality involves checking for missing values, data format mismatches, or outliers among the incoming data points. Set rules and acceptable ranges for values whenever possible, too (e.g., “customer age can’t be negative”).

Metrics to monitor:

Share of missing values
Deviations from data schema and feature ranges
Feature statistics (mean values, min-max ranges)
Outlier frequency

ML Model Quality and Relevance

Tracking model quality and relevance gives your team an idea of how accurate the model is and how well it performs. Your choice of metrics, however, hinges on the type of model your Python developers built.

Important: Monitor model quality metrics across the whole ML model and its specific cohorts (e.g., customer types, devices, etc.). Segment-based monitoring will reveal issues that might go unnoticed in the bird’s-eye view of model quality metrics.

Metrics to monitor:

Classification: model accuracy, precision, recall, F1-score
Regression: mean absolute error (MAE), mean squared error (MSE), mean absolute percentage error (MAPE), etc.
Ranking and recommendations: normalized discounted cumulative gain (NDCG), precision at K, mean average precision (MAP), etc.

Drift Detection

Over time, ML model quality tends to deteriorate, plain and simple. That’s known as model drift, and it comes in two main flavors.

Concept drift is a shift in the relationship between inputs and outputs that makes the patterns identified by ML models no longer accurate. Think of the initial training datasets as a snapshot of the world: it was picture-perfect then, but things have changed. So, the snapshot is now outdated.

Data drift is a change in the statistical characteristics of the input. It happens when the real-world data deviates from the data in its training datasets or earlier inputs. Simply put, data drift means your model is being used with inputs it wasn’t meant to work with.

Metrics to monitor:

Concept drift: Changes in correlation coefficients (Pearson's, Spearman's), input data drift metrics
Data drift: Key statistics for variables (mean, median, variance, quantiles, etc.), feature min-max range compliance, statistical test results (p-value), distribution distance metrics (Wasserstein Distance, Jensen-Shannon Divergence)

Tools for Monitoring ML Models

You can take your pick among multiple tools designed to streamline monitoring. For example:

Evidently AI. Evidently is an open-source library for Python developers that enables monitoring both in development and production. It can conduct batch model checks, monitor metrics in real time, and visualize results in interactive dashboards.
Fiddler AI. Fiddler is an easy-to-use tool that combines model monitoring, observability, and guardrail management. Monitoring features include outlier detection, service metrics tracking, and drift visualization.
Amazon SageMaker Model Monitor. Meant for ML models built with Amazon SageMaker, it both runs batch checks and monitors real-time stats. You can also set up alerts and use prebuilt monitoring templates.

Integrio's AI case for a UK diagnostic services provider

Maintaining ML Models

Setting up metric tracking and alerts isn’t enough on its own. You’ll also need to be proactive and take action to ensure your ML model remains stable and reliable.

Implement a Data Schema

A data schema is the blueprint for organizing the incoming raw data. It describes the types of data, relationships between variables, expected formats, and possible or required attributes.

Creating a data schema allows you to set rules for input validation. That, in turn, helps catch anomalies, unexpected values, or unusual data distributions early on.

Set Up Automated Alerts

Unlike other software, ML models can fail silently. Instead of flashing an error message, they’ll still do what they’re meant to do — return a prediction, generate a response — except that their output will be biased or inaccurate.

That’s why you need to set up alerts to notify your team whenever metrics cross certain thresholds to catch model degradation. Start with generous thresholds and tighten them as you learn more about your model’s behavior.

Pro tip: Use a layered strategy based on alert criticality. Some changes in metrics can be logged, and that’s it; others warrant issuing batch warnings or real-time critical alerts.

Use Version Control

Versioning is crucial for keeping everyone on the same page regarding model changes, the reasoning behind them, and their impact. Besides, it allows the team to roll back to a previous version if needed or trace the output to its input during a compliance audit.

When it comes to ML applications, you need to version the source code itself, as well as artifacts like:

Training datasets
Feature definitions
Environment dependencies
Evaluation metrics
Validation results

Git is the gold standard for code versioning, while tools like DVC were created specifically for versioning datasets.

Ensure Explainability

Explainability is key to building a trusted AI solution, and that’s why 46% of companies are concerned about it. This characteristic means you can trace the model’s reasoning to understand why it returns these results. That helps debug the model or detect biased output.

There are two common model-agnostic ways to ensure model explainability:

SHapley Additive exPlanations (SHAP). Available as a library, it calculates each feature’s contribution to predictions made across the whole model. The library also offers multiple visualization tools (e.g., summary plot, dependence plot).
Local Interpretable Model-agnostic Explanations (LIME). Instead of focusing on the whole model, it explains specific results on a case-by-case basis. The LIME module calculates feature contributions and presents them as a table.

Incorporate Human-in-the-Loop (HITL) Systems

The human-in-the-loop approach means involving human input during model training, validation, and optimization. In maintenance, that means involving domain experts in reviewing and adjusting models to:

Provide feedback to ML models
Evaluate the model’s performance
Label new data for further retraining

An HITL system can also refer to integrating a human feedback loop into the application’s workflows themselves. For example, an ML model can identify the most likely diagnosis for a patient, but a doctor will still have to review it before making a judgment call.

Conduct Regular Audits

Comprehensive audits help catch errors or drift early, verify reproducibility, ensure consistency across model versions, and revalidate assumptions, among other things. During an audit, check:

Bias and fairness in datasets and model output
Regulatory compliance (e.g., GDPR, HIPAA)
Model transparency and explainability
Model security and stability
Data quality and integrity
Model governance and controls

Integrio’s AI сase: enhancing workplace safety

Improving ML Models

As you continue monitoring and maintaining your model, you’ll inevitably spot opportunities for improvement — or at least gather real-world data to inform your future development choices. Here’s what improving an ML model means in practice.

Regularly Retrain the Model

Data doesn’t stand still, and neither should your model. To keep its output relevant and accurate, you’ll need to retrain it using more recent data.

That retraining can follow a fixed schedule or respond to model degradation signals. It’s not an either-or choice; in fact, it’s best to combine both.

How often should you retrain your ML model outside metric-driven trigger events? It depends. Retraining can happen daily, weekly, or monthly, depending on the rate of data change.

Automate CI/CD Pipelines

Considering how often you may need to retrain your model, relying on manual processes is decidedly not an option. Luckily, automation can speed up the whole process.

That’s what MLOps is for: this methodology applies the principles of DevOps to ML systems. It involves implementing:

Continuous integration (CI): Automatically testing and validating code, components, data, data schemas, and models
Continuous delivery (CD): Automatically deploying the model prediction service
Continuous training (CT): Automatically retraining and serving the models

The benefits of an automation-first approach are obvious: faster validation and rollout, fewer manual errors, and accelerated time-to-market.

Gather Ground Truth Data

On its own, an ML model can’t tell the truth from fiction. That’s why it needs ground-truth data, the gold standard of accuracy. During development and validation, it consists of labeled datasets.

After you deploy your ML model, compare the model’s real-world output to ground truth data to evaluate its performance. Plus, keep updating ground-truth data to align with changing real-world patterns and inputs.

Implement Shadow Models

Shadow models are the new or updated ML models that you’re considering putting into production. Deploying them in parallel to the actual production model allows you to see how they perform in a live environment.

Under this arrangement, the shadow model (the candidate) is given the same input as the current production model. Both generate a response, but the user sees only the production model’s output. So, you can test the model in field conditions without impacting user experience.

Iterate on Feature Engineering

Your initial assumptions will lead you to pick one set of features, but it may be too wide or narrow for your needs. So, use the real-world performance data to refine that set by removing or adding features. You can also alter existing features to optimize data.

When done right, refining feature selection increases model reliability and accuracy. However, don’t forget to test and validate the new model version before rolling it out into production.

Final Thoughts

Post-deployment, ML-powered applications require a lot more upkeep than their non-ML counterparts. So, factor that in when you’re allocating resources and planning the project. Remember: if you don’t put in the work, the ML model will start failing at some point, and you’ll be none the wiser.

Not sure where to start with MLOps or which tools to select for model monitoring? Integrio’s machine learning experts can offer you sound advice — or handle model monitoring, maintenance, and improvement from start to finish. Get in touch with us to discuss how we can help you overcome your post-deployment challenges.

FAQ

Machine learning models may degrade due to input data quality issues and concept or data drift. For example, historical patterns may no longer reflect current patterns.

To use a production-ready ML model, you need to deploy it into a live environment. That means integrating it with other systems and setting up continuous monitoring and maintenance.

Before deployment, set up data pipelines, test and validate the model, deploy infrastructure, and establish processes for monitoring and maintaining the system. We also strongly advise implementing MLOps.

When testing an ML model, check its robustness, interpretability, and reproducibility, as well as output bias, fairness, compliance, and transparency. To assess the model’s performance, you can run invariance, directional expectation, and minimum functionality tests.

Navigation

Machine Learning in Production: How to Monitor, Maintain, and Improve ML Models Monitoring Machine Learning Models in Production Maintaining ML Models Improving ML Models Final Thoughts FAQ

Contact us

Machine Learning in Production: How to Monitor, Maintain, and Improve ML Models

Monitoring Machine Learning Models in Production

Software System Health

Data Quality and Integrity

ML Model Quality and Relevance

Drift Detection

Tools for Monitoring ML Models

Maintaining ML Models

Implement a Data Schema

Set Up Automated Alerts

Use Version Control

Ensure Explainability

Incorporate Human-in-the-Loop (HITL) Systems

Conduct Regular Audits

Improving ML Models

Regularly Retrain the Model

Automate CI/CD Pipelines

Gather Ground Truth Data

Implement Shadow Models

Iterate on Feature Engineering

Final Thoughts

FAQ

Why do machine learning models degrade in production?

How do we use a machine learning model in production?

How should we deploy machine learning models in production?

How should you test machine learning models in production?

Contact us