AI Model Monitoring in Production: What to Track and Why Most Teams Get It Wrong
There’s a well-known pattern in AI projects: the team spends months building and training a model, deploys it with fanfare, and then… moves on to the next project. Nobody’s watching the model in production. Nobody notices when accuracy quietly degrades. Six months later, someone runs an analysis and discovers the model has been making increasingly bad predictions for weeks.
This isn’t a rare failure mode. It’s the default outcome for teams that don’t invest in monitoring. And yet model monitoring remains one of the least developed areas of the MLOps lifecycle — partly because it’s less exciting than model building, and partly because it requires thinking about failure modes that haven’t happened yet.
Here’s what production model monitoring should actually look like.
Why Models Degrade
Models are trained on historical data. They learn patterns in that data and apply those patterns to new inputs. When the world changes — and it always does — the patterns in new data start diverging from the patterns the model learned. This is called drift, and it comes in several forms.
Data Drift
The distribution of input features changes over time. A customer churn model trained on pre-pandemic data encounters dramatically different usage patterns post-pandemic. An e-commerce recommendation model trained during normal trading sees completely different browsing patterns during a supply chain crisis.
Data drift is inevitable. The question isn’t whether it’ll happen, but how fast and how severely.
Concept Drift
The relationship between inputs and outputs changes. The same customer behaviors that previously predicted churn now predict loyalty (or vice versa). This is more insidious than data drift because the inputs might look normal while the underlying patterns shift.
Label Drift
The distribution of outcomes changes. If your fraud detection model was trained on a period where 2% of transactions were fraudulent, and that rate shifts to 5%, the model’s threshold settings become inappropriate even if it’s technically still detecting the same patterns.
What to Monitor
Input Data Quality
Before worrying about model performance, monitor the data coming in. This catches problems before they affect predictions.
Schema violations: Missing fields, unexpected data types, null values in required columns. These are immediate failure modes that should trigger alerts.
Distribution shifts: Track the statistical properties of each input feature — mean, variance, min, max, percentiles. When these deviate significantly from training data distributions, flag it. Tools like Evidently AI provide open-source drift detection specifically for this purpose.
Volume anomalies: Sudden spikes or drops in prediction request volume often indicate upstream problems — broken data pipelines, system outages, or user behavior changes.
Model Performance Metrics
This seems obvious, but it’s harder than it sounds in production because you often don’t have ground truth labels immediately.
For models with delayed feedback: In fraud detection, you know months later whether a flagged transaction was actually fraudulent. In recommendation systems, you know immediately whether the user clicked. The feedback delay determines how quickly you can detect performance degradation.
Proxy metrics: When direct performance measurement is delayed, track proxy metrics that correlate with model quality. Prediction confidence scores, prediction distribution changes, and user interaction patterns can signal degradation before you have ground truth labels.
Segmented performance: Aggregate metrics can hide problems. A model that’s 95% accurate overall might be 50% accurate for a specific customer segment, product category, or geographic region. Monitor performance across relevant segments, not just in aggregate.
System Performance
Model quality metrics won’t help if the model isn’t running properly.
Latency: Track prediction latency at p50, p95, and p99. A model that averages 50ms but occasionally takes 5 seconds is going to cause timeout errors and bad user experiences.
Error rates: Failed predictions, timeout rates, out-of-memory errors. These should be dashboarded and alerted on.
Resource utilisation: CPU, memory, GPU utilisation. Creeping resource usage often indicates data volume growth or model inefficiency.
Setting Alert Thresholds
This is where most teams struggle. Set thresholds too tight and you get alert fatigue — your monitoring system cries wolf constantly and people start ignoring it. Set them too loose and you miss genuine degradation.
Statistical Process Control
Borrow from manufacturing quality control. Use control charts — track your metric over time, compute the mean and standard deviation from a stable baseline period, and set alert thresholds at 2 sigma (warning) and 3 sigma (critical).
This adapts to the natural variability in your metrics. A metric that varies a lot in normal operation gets wider bounds. A stable metric gets tighter bounds. It’s much more principled than picking arbitrary thresholds.
Page’s Cumulative Sum (CUSUM) Test
For detecting gradual drift rather than sudden changes, CUSUM tests are more sensitive than simple threshold alerts. They accumulate small deviations over time and trigger when the cumulative deviation exceeds a threshold. This catches slow degradation that individual data points would miss.
Business Impact Thresholds
Sometimes the right threshold isn’t statistical — it’s business-driven. “Alert me when the false positive rate exceeds 5%” is a business decision, not a statistical one. Define what level of model degradation is acceptable in business terms, then translate that into metric thresholds.
Monitoring Architecture
A practical monitoring setup doesn’t need to be complex.
Logging: Log every prediction with its inputs, outputs, confidence scores, latency, and timestamp. This is your raw data for all downstream monitoring. Use structured logging — JSON format — so it’s easy to parse and query.
Metrics pipeline: Aggregate logged predictions into summary metrics at regular intervals (hourly, daily). Compute drift statistics, performance metrics, and system health indicators.
Dashboard: A single dashboard showing key health indicators. I’ve seen teams from AI consultants in Sydney recommend organising this into three tiers: system health (is the model serving requests?), data health (are inputs within expected ranges?), and model health (are predictions reasonable?).
Alerting: Automated alerts for critical issues (system failures, extreme drift) routed to on-call engineers. Summary digests for non-critical trends routed to the data science team for review.
Common Mistakes
Monitoring only aggregate accuracy. A model can maintain 95% overall accuracy while becoming completely unreliable for a specific segment. Always slice performance by relevant dimensions.
Not establishing baselines. You can’t detect degradation if you don’t know what normal looks like. Before deploying, establish baseline performance metrics on a holdout test set. Compare production performance against this baseline.
Treating monitoring as a one-time setup. Monitoring configurations need to evolve as the model and its environment change. Review alert thresholds quarterly. Add new monitoring dimensions as you discover new failure modes.
Ignoring the feedback loop. Monitoring tells you something is wrong. You also need processes for acting on that information — retraining triggers, rollback procedures, escalation paths. Monitoring without response procedures is just watching your model fail in real time.
The Minimum Viable Monitoring Stack
If you’re starting from nothing, implement these four things first:
- Log all predictions with inputs, outputs, and metadata
- Track input feature distributions and alert on statistical drift
- Monitor system metrics — latency, error rates, throughput
- Schedule regular performance reviews — even monthly manual reviews of a sample of predictions against ground truth is better than nothing
You can build from there, but these four cover the most common failure modes and give you the data to diagnose problems when they occur. Model monitoring isn’t glamorous work, but it’s the difference between a model that delivers value for years and one that quietly becomes useless within months.