Scaling Synthetic Data: From Thousands to Millions

Our synthetic data efforts have proven the concept. Now we need to scale from prototype (thousands of images) to production (millions per training run).

Scale Requirements

Training a robust perception model needs:

Hand tracking: 5M+ images across poses, lighting, backgrounds
Eye tracking: 2M+ images across gaze directions, face shapes
Scene understanding: 10M+ images across environments, objects

At our current rate (1000 images/day), the hand tracking dataset alone would take 14 years.

The Scaling Challenge

Compute

Rendering bottleneck: ~5 seconds per image on high-end GPU.

Solutions:

Cloud burst: Spin up 1000 GPU instances for render jobs
Render optimization: Denoising enables fewer samples
Lower fidelity where okay: Not every image needs ray tracing

Target: 100,000 images/day sustained.

Asset Pipeline

Diverse training needs diverse assets:

1000+ 3D environments
10,000+ objects
500+ hand textures/shapes
Unlimited procedural variations

Building asset pipelines with:

Automated acquisition from 3D repositories
Procedural variation of base assets
Quality validation gates

Variation Management

How do you sample from a trillion possible configurations?

Strategies:

Stratified sampling: Ensure coverage of known important factors
Curriculum sampling: Start uniform, then focus on failure cases
Active learning: Let the model tell you what it struggles with

Metadata and Versioning

Every image needs:

Exact scene configuration (reproducibility)
Ground truth labels (automatic from render)
Variation parameters (for analysis)

Storage and tracking for millions of images = database engineering problem.

Infrastructure Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Orchestration Layer                       │
│  - Job scheduling                                            │
│  - Resource allocation                                       │
│  - Progress tracking                                         │
└─────────────────────────────────────────────────────────────┘
                              │
         ┌────────────────────┼────────────────────┐
         ▼                    ▼                    ▼
┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
│  Asset Service  │  │  Render Farm    │  │  Data Lake      │
│  - Asset DB     │  │  - GPU cluster  │  │  - Image store  │
│  - Variation    │  │  - Job workers  │  │  - Metadata DB  │
│  - Validation   │  │  - Output queue │  │  - Versioning   │
└─────────────────┘  └─────────────────┘  └─────────────────┘

Cost Management

Cloud GPU rendering is expensive:

1000 GPUs × $2/hr × 24 hours = $48,000/day
100M images = $4.8M in compute alone

Cost reduction strategies:

Spot instances: 70% savings for fault-tolerant workloads
Render optimization: 2x speed = 50% cost
Smart sampling: Better coverage with fewer images

Annual synthetic data budget: $2M. Must make it count.

Validation

How do we know synthetic data is good?

Visual inspection: Sample renders reviewed by humans
Distribution analysis: Compare stats to real data
Model performance: Ultimately, does training on it work?

Building automated validation into the pipeline - no bad data reaches training.

Comments