Model Versioning in Production MLOps: Beyond Git and DVC


Software engineers have mature version control practices: Git tracks code changes, semantic versioning indicates breaking changes, and release tags mark production deployments. ML model versioning needs to achieve similar goals but faces fundamentally different challenges.

A model isn’t just code. It’s the combination of training data, training code, hyperparameters, random seeds, dependency versions, and the resulting weights. Change any of these inputs, and you get a different model. To reproduce a model or understand why performance changed, you need to version all of these components together.

Git alone doesn’t solve this. Most ML practitioners have experienced the “wait, which version of the model was that?” problem when trying to reproduce results or debug production issues. Here’s what we’ve learned about versioning models properly in production systems.

What Needs to Be Versioned

Model artifacts: The trained weights/parameters, typically serialized as .pkl, .h5, .pt, .safetensors, or ONNX files. These are what you deploy to production, but they’re meaningless without context about how they were created.

Training data: At minimum, a hash or reference to the exact dataset version used for training. Ideally, the full dataset or a reproducible pipeline to generate it. Data drift is real—models trained on January data perform differently than models trained on March data, even with identical code.

Training code: The scripts, notebooks, or pipelines that produced the model. This includes preprocessing code, model architecture definitions, and training loops.

Hyperparameters: Learning rate, batch size, number of epochs, regularization parameters, optimizer choice—everything that configured the training run. These need to be captured automatically, not manually recorded.

Dependencies: The exact versions of libraries used (PyTorch 2.1.0 vs 2.2.0 can produce different results). A requirements.txt or Pipfile.lock pinned to specific versions.

Evaluation metrics: Validation accuracy, loss curves, confusion matrices, or whatever metrics define model quality. You need to know not just that model v3 exists, but whether it’s better than model v2.

Environment information: Hardware used (GPU type affects numerical precision in some cases), random seeds (for reproducibility), and any other environmental factors that affect training.

Tracking all of this manually via Git commits and documentation doesn’t scale. You need tooling.

DVC: Data Version Control

DVC extends Git to handle large files (datasets, model artifacts) that don’t belong in Git repositories. It stores large files in remote storage (S3, GCS, Azure Blob) and tracks lightweight pointers in Git.

The workflow:

dvc add data/training_data.csv  # Track dataset
dvc add models/model.pkl        # Track model artifact
git add data/training_data.csv.dvc models/model.pkl.dvc
git commit -m "Training run v3"
dvc push                        # Upload files to remote storage

When someone else checks out that commit, they run dvc pull to download the exact data and model files associated with that code version.

Where DVC works well:

  • Versioning large datasets alongside code
  • Ensuring data and model artifacts are tied to specific Git commits
  • Simple reproducibility for individual training runs

Where DVC falls short:

  • No built-in experiment tracking (which hyperparameters produced which results)
  • No query interface (finding “the best model trained on dataset X”)
  • Manual workflow (you must remember to dvc add after every training run)
  • Doesn’t capture runtime information (metrics, logs, hardware used)

DVC solves part of the problem but isn’t a complete MLOps versioning solution by itself.

MLflow Model Registry

MLflow provides a model registry specifically designed for managing ML model lifecycle. It tracks:

  • Model artifacts (weights/parameters)
  • Associated metadata (hyperparameters, metrics, Git commit hash)
  • Model versions with semantic stages (Staging, Production, Archived)
  • Lineage (which dataset/code version produced this model)

The workflow:

import mlflow

with mlflow.start_run():
    # Train model
    model = train_model(X_train, y_train)

    # Log parameters, metrics, and model
    mlflow.log_params({"learning_rate": 0.01, "batch_size": 32})
    mlflow.log_metrics({"accuracy": 0.94, "f1": 0.92})
    mlflow.sklearn.log_model(model, "model")

Models are automatically versioned in the MLflow registry. You can query for models by metric (“show me all models with accuracy > 0.90”), promote models between stages (Staging → Production), and track which model version is deployed where.

Where MLflow works well:

  • Central registry for trained models across teams
  • Comparing experiments and finding best-performing models
  • Managing model promotion (dev → staging → production)
  • Integrating with deployment tools (many serving platforms can pull directly from MLflow)

Where MLflow has limitations:

  • Versioning large datasets still requires external solutions (MLflow can track dataset URLs/hashes but doesn’t store the data itself)
  • Complex multi-stage ML pipelines (feature engineering → training → post-processing) aren’t natively represented
  • The UI can be slow with thousands of experiments

MLflow is the current standard for model versioning in production systems. Most teams using MLOps at scale are using MLflow or a competing product (Weights & Biases, Neptune, Comet) that provides similar functionality.

Model Cards and Documentation

Technical versioning (tracking artifacts and metadata) is necessary but not sufficient. People need to understand what a model does, what data it was trained on, what its limitations are, and how it should (and shouldn’t) be used.

Model Cards (proposed by Google researchers in 2019) provide a structured documentation format that includes:

  • Model details (architecture, training procedure)
  • Intended use and limitations
  • Training data characteristics and potential biases
  • Evaluation results across different demographic subgroups
  • Ethical considerations

Model cards sit alongside technical version metadata. When you version model v12, you also update the model card to reflect changes in training data, evaluation results, or known issues.

Implementing model cards manually is tedious. Some teams automate parts of this—generating evaluation reports and bias analysis, then combining them with human-written sections on intended use and limitations.

Semantic Versioning for Models

Software uses semantic versioning: MAJOR.MINOR.PATCH (e.g., 2.3.1). Changes to MAJOR indicate breaking changes, MINOR indicates new features, PATCH indicates bug fixes.

Can we apply similar logic to models?

Proposed model semantic versioning:

  • MAJOR: Change in model architecture, completely different training approach, or change that requires updated serving infrastructure. Example: switching from a scikit-learn model to a PyTorch neural network. Deployment code needs to change.

  • MINOR: Same architecture, retrained on new data, or hyperparameter tuning that improves performance. Input/output interface remains the same. Deployment code doesn’t need to change, but predictions will differ.

  • PATCH: Bug fixes in preprocessing, evaluation metric corrections, or documentation updates. The model artifact itself doesn’t change, or changes are minimal and provably non-breaking.

This isn’t a universal standard yet, but some teams implement it to communicate impact of model updates. “Upgrading from model 2.3 to model 2.4 is safe (minor version). Upgrading from 2.9 to 3.0 requires code changes (major version).”

Lineage Tracking and DAGs

Complex ML systems have multi-stage pipelines: raw data → cleaning → feature engineering → training → evaluation → deployment. Each stage produces artifacts that feed the next stage.

Tracking lineage—which dataset version produced which features, which features trained which model—requires DAG (directed acyclic graph) tracking.

Tools like Kubeflow Pipelines and Metaflow represent ML workflows as DAGs where each node is a versioned step with inputs and outputs. When you query “how was model v12 created?”, the system shows the full lineage from raw data through intermediate transformations to final model.

This is essential for debugging (“validation accuracy dropped—was it a data quality issue or a training code bug?”) and auditability (regulatory requirements in finance/healthcare often require full reproducibility of model training).

Immutable Model Artifacts

Once a model is promoted to production, the artifact should be immutable. Model v5 in production today should produce identical predictions six months from now when you need to debug a past decision or reproduce results.

This requires:

  • Immutable storage (S3 versioning, GCS object versioning) so artifacts can’t be overwritten
  • Pinned dependencies (containerization or packaged environments that capture all libraries)
  • Archived training data (you must be able to access the exact data used for training, even years later)

Some teams use content-addressable storage: the model artifact’s hash becomes its identifier. If the artifact changes in any way, it gets a new hash/identifier. This makes tampering or accidental modification immediately obvious.

Practical Workflow for Production

Here’s a production-ready versioning workflow combining the tools above:

  1. Development: Experiments tracked in MLflow. Every training run logs code version (Git commit hash), hyperparameters, datasets (DVC reference), and metrics.

  2. Dataset versioning: Training data tracked with DVC (or similar). Data pipelines produce versioned outputs with checksums.

  3. Model registration: Successful models logged to MLflow Model Registry with “Staging” status. Includes model card documentation.

  4. Evaluation: Offline evaluation against held-out test set and bias analysis. Results attached to model version in MLflow.

  5. Promotion: Model promoted to “Production” stage in MLflow after passing evaluation. This triggers deployment pipeline.

  6. Deployment: Model artifact and dependencies packaged as immutable container image. Deployed to serving infrastructure. Deployment records which model version is running in which environment.

  7. Monitoring: Online metrics (latency, throughput, prediction distribution) tracked against model version. If performance degrades, lineage tracking lets you quickly identify whether it’s a data drift issue or model issue.

  8. Rollback: If model v6 in production has issues, revert to model v5 (which is immutable and still available). Deployment system pulls the previous version from MLflow and redeploys.

This workflow isn’t simple, but it provides the reproducibility, auditability, and reliability that production ML systems require.

What We Recommend

For teams building production ML systems:

  • Use MLflow or a similar model registry as your central source of truth for model versions
  • Version training data with DVC or a data versioning tool integrated with your data warehouse
  • Track lineage for multi-stage pipelines with a workflow orchestrator (Kubeflow, Metaflow, Airflow with ML extensions)
  • Implement model cards for documentation
  • Make model artifacts immutable once deployed to production
  • Monitor deployed model versions and maintain the ability to roll back quickly

Model versioning is more complex than code versioning, but the investment in proper tooling pays off the first time you need to reproduce a training run, debug a production issue, or explain to regulators exactly how a deployed model was created.