Production ML Systems: Beyond Model Training

Training a good model is maybe 20% of production ML. The other 80% is everything else.

The Full ML System

Data Pipeline → Feature Engineering → Training → Evaluation →
Deployment → Monitoring → Feedback → Data Pipeline

Most ML research focuses on training. Production systems need everything else to work too.

In production:

Our system:

Features for VPS:

Production concerns:

Training/serving skew is a top cause of production issues.

At Meta scale:

We retrain VPS models weekly with latest data.

Research evaluation: Benchmark accuracy. Production evaluation: Will this model improve the product?

Additional metrics:

Comprehensive eval gates prevent bad models from shipping.

Gradual rollout:

Any regression triggers automatic rollback.

In production, you can't manually check every prediction:

Alerts trigger investigation before users complain.

User feedback (implicit and explicit) improves future models:

Closing this loop is what enables continuous improvement.

Invest in infrastructure early: Retraining, deployment, monitoring save more time than model tweaks.
Reproducibility is essential: Must be able to recreate any historical model.
Monitor everything: You can't fix what you can't see.
Plan for failure: Models will misbehave. Fast detection and rollback limit damage.