MLOps Monitoring: What to Track When Models Go to Production


The first machine learning model I deployed to production worked beautifully for three weeks, then quietly started producing nonsense. We didn’t notice for five days because we were only monitoring server uptime and response times. The model was responding quickly with complete garbage.

That expensive lesson taught me that MLOps monitoring requires a fundamentally different approach than traditional software monitoring. Here’s the framework I’ve developed for keeping production ML systems healthy.

The Four Pillars of ML Monitoring

I organize ML monitoring into four categories: data quality, model performance, system health, and business impact. Each requires different tools and strategies.

1. Data Quality Monitoring

Most ML failures I’ve encountered trace back to data issues. Models trained on one distribution fail when production data drifts.

Distribution monitoring: I track the statistical distribution of input features over time. Significant shifts in mean, variance, or percentiles indicate distribution drift. For categorical features, I monitor the frequency of each category.

Tools like Evidently AI and WhyLabs make this relatively straightforward, but I’ve also built custom monitoring using simple statistical tests. A two-sample Kolmogorov-Smirnov test comparing recent production data to training data distribution can catch drift early.

Null and missing values: Production data is messier than training data. I track the rate of null values per feature. A sudden increase often indicates upstream data pipeline issues.

Feature correlation changes: In one project, two input features that were uncorrelated during training became highly correlated in production due to a change in data collection. This didn’t break the model immediately but degraded performance gradually. Now I monitor key feature correlations.

Out-of-range values: I define expected ranges for numerical features and alert when values fall outside. This catches data collection bugs quickly. For example, a temperature sensor in a manufacturing setting sending negative Kelvin values is clearly broken.

2. Model Performance Monitoring

This is the obvious one, but implementing it effectively is trickier than it seems.

Ground truth delay problem: Many ML applications don’t get immediate feedback. A credit risk model might not know if a prediction was correct for months. A recommendation system doesn’t know if an unrecommended item would have been clicked.

I handle this with proxy metrics and sampled ground truth collection:

Proxy metrics: Find correlates of performance that you can measure quickly. For a churn prediction model, I tracked the rate of high-confidence predictions (predictions >90% probability). When this dropped, it usually indicated distribution drift even before we had ground truth.

Sampled ground truth: For expensive-to-obtain labels, I sample a subset of predictions for manual labeling. This gives directional performance tracking without labeling every prediction.

Prediction distribution monitoring: I track the distribution of predicted values. For a classification model, a sudden shift in the ratio of predicted classes often indicates problems. I once caught a bug where a feature was accidentally scaled differently in production, causing the model to predict one class >95% of the time.

3. System Health Monitoring

ML systems have unique system health requirements beyond standard application monitoring.

Latency percentiles: I track p50, p95, and p99 latency separately. For real-time inference, tail latency matters enormously. A model that responds in 50ms on average but takes 5 seconds for the p99 case creates poor user experience.

Resource utilization patterns: ML inference can have spiky resource usage. I monitor CPU, memory, and GPU utilization (if applicable) and alert on unusual patterns. A gradual increase in memory usage per prediction suggests a memory leak specific to certain input patterns.

Model version tracking: In systems serving multiple model versions (during gradual rollouts), I tag all metrics with model version. This makes it possible to identify if a specific version is causing issues.

Dependency health: ML models often depend on feature stores, preprocessing pipelines, or external data sources. I monitor the health of these dependencies separately. A healthy model serving unhealthy data is worse than no model at all.

4. Business Impact Monitoring

Technical metrics matter, but business metrics matter more. I learned this after optimizing a model’s AUC while accidentally reducing revenue.

North star metric alignment: Every ML system should tie to a business metric. For a recommendation system, it might be click-through rate or revenue. I track this metric split by model version, timeframe, and key segments.

Segment-level performance: Aggregate metrics can hide serious issues. A model might perform well overall but terribly for a specific customer segment or product category. I monitor performance broken down by business-relevant segments.

Canary metrics: I identify leading indicators that predict business impact. For a pricing model, I tracked the variance in predicted prices. High variance often preceded customer complaints, giving us early warning.

Alerting Strategy

Having metrics is useless without actionable alerts. But alert fatigue is real. My approach:

Tiered alerting:

  • P0 alerts: Immediate response required (model serving errors, catastrophic performance drop)
  • P1 alerts: Investigate within hours (significant drift, performance degradation)
  • P2 alerts: Investigate within days (gradual drift, minor performance changes)

Contextual thresholds: Static thresholds create false positives. I use dynamic thresholds based on historical patterns. A 5% drop in accuracy might be normal Monday-to-Tuesday variance but concerning over a week.

Alert bundling: Related alerts fire together (e.g., data drift + performance drop). I bundle these to reduce noise and make root cause investigation easier.

The Retraining Decision Framework

Monitoring tells you when something is wrong. Deciding when to retrain is a separate question.

I retrain when:

  • Performance drops below acceptable thresholds on sampled ground truth
  • Significant data drift persists for more than X days (X varies by model)
  • Business metrics show sustained degradation

I don’t automatically retrain on:

  • Minor drift that hasn’t impacted performance
  • Short-term anomalies (holidays, one-off events)
  • Issues traced to data quality problems (fix the data pipeline first)

Tools and Implementation

I’ve used various tools across projects:

Open source: MLflow for experiment tracking and model registry, Evidently AI for drift detection, Prometheus + Grafana for metrics and dashboards.

Commercial: Datadog for unified monitoring, Arize AI for ML-specific observability, AWS SageMaker Model Monitor for AWS-hosted models.

Custom: For unique requirements, I’ve built custom monitoring using Python, statistical tests, and time-series databases (InfluxDB or TimescaleDB).

The choice depends on budget, team size, and specific requirements. I generally start simple with open-source tools and add specialized solutions as needs become clear.

Real-World Example

A fraud detection model I worked on experienced gradual performance degradation over two months. Here’s how monitoring helped:

  1. Data drift alerts triggered first, showing shift in transaction amount distribution
  2. Feature correlation monitoring revealed that merchant category codes were correlating differently with transaction amounts
  3. Prediction distribution shifted toward higher fraud probability predictions
  4. Business metrics showed increasing false positive rate, annoying customers

Root cause: New payment processor had different merchant category code mapping. Solution: Retrained with recent data and added preprocessing to normalize merchant codes.

Without comprehensive monitoring, we would have noticed only when customer complaints spiked, weeks later.

The Human Element

Automated monitoring is critical, but human judgment is irreplaceable. I schedule regular model review sessions where the team looks at:

  • Borderline predictions (low confidence cases)
  • Recent errors on sampled ground truth
  • Unusual patterns in monitored metrics

These sessions often catch subtle issues automated monitoring misses and build team intuition about model behavior.

Scaling Considerations

Monitoring one model is straightforward. Monitoring dozens or hundreds requires different approaches.

For organizations with many models, I recommend:

  • Standardized monitoring templates per model type
  • Centralized monitoring dashboards with model-specific drill-down
  • Automated health checks that run across all models
  • Clear ownership: each model has a designated team responsible for monitoring and maintenance

Teams specializing in AI agent builders work on exactly this kind of production ML infrastructure, because scaling AI systems requires treating them as critical infrastructure, not experimental projects.

Continuous Improvement

Monitoring strategy should evolve with your understanding of model behavior. I review and update monitoring:

  • After any production incident (what would have caught it earlier?)
  • Quarterly, to add metrics for new patterns we’ve observed
  • When deploying new model versions with different characteristics

The goal is a monitoring system that catches issues before users do, provides clear diagnostic information, and supports confident decision-making about model maintenance.

Effective MLOps monitoring transforms ML from a risky experiment into a reliable system component. It’s not exciting work, but it’s what separates prototype from production.