Founding the Synthetic Data Team

After a year of building perception systems, one pattern keeps emerging: we're bottlenecked on data. Today I made the case for a dedicated synthetic data team. It got approved.

The Data Problem

Machine learning for perception needs:

Scale: Millions of labeled examples
Diversity: Every lighting condition, environment, user variation
Accuracy: Pixel-perfect labels for segmentation, millimeter-perfect depth
Edge cases: The rare scenarios where systems fail

Real data collection fails on all counts:

Expensive and slow to collect
Limited diversity (can only capture what exists)
Labels are noisy (human annotation has errors)
Edge cases are by definition rare

The Synthetic Data Promise

In simulation:

Generate unlimited data programmatically
Perfect ground truth by construction
Full control over conditions
Easy to synthesize rare events

The catch: synthetic data must be "real enough" to transfer to actual sensors.

Team Charter

The synthetic data team will:

Build rendering infrastructure for sensor-accurate simulation
Create asset pipelines for scalable environment generation
Develop domain adaptation techniques to close the reality gap
Establish validation protocols to ensure transfer

Initial Focus Areas

Eye tracking: Eye images in simulation - eyelid positions, pupil sizes, gaze directions. Relatively contained domain.

Depth sensors: Simulating ToF and structured light including sensor noise models.

RGB features: The hardest - photorealistic rendering at scale.

Hiring Profile

Different from typical CV engineers:

Graphics expertise: Real-time rendering, PBR, ray tracing
Procedural generation: Creating variation programmatically
Domain knowledge: Understanding what perception algorithms need
Tooling mindset: Building systems others can use

Found our first hire - a graphics engineer from gaming who's excited about the applied ML angle.

Success Metrics

How do we know synthetic data is working?

Gap measurement: Performance on real test set after training on synthetic vs real data
Marginal value: Does adding more synthetic data keep improving performance?
Coverage: Are we reaching scenarios impossible to capture in real data?

Goal: synthetic data enables better models than pure real-data training by end of next year.

This is a bet. But the alternative - scaling real data collection indefinitely - is not sustainable.