Testing Quality Infrastructure Perception Magic Leap

Building a Perception Testing Framework

Systematic testing infrastructure for perception systems - from unit tests to integration tests to field validation.

Evyatar Bluzer

January 20, 2020

3 min read

Perception systems are notoriously hard to test. Inputs are high-dimensional (images, depth maps), outputs are continuous (poses, keypoints), and "correct" often means "good enough." Here's how we built a testing framework.

Testing Pyramid for Perception

            /\
           /  \  Field Validation
          /    \  (real users, real environments)
         /──────\
        /        \ System Tests
       /          \ (full pipeline, recorded data)
      /────────────\
     /              \ Integration Tests
    /                \ (component boundaries)
   /──────────────────\
  /                    \ Unit Tests
 /                      \ (individual functions)
/────────────────────────\

Each level catches different bugs. All are necessary.

Unit Tests

Testing individual functions:

Feature detector: given image patch, detect correct corners
Depth filter: given noisy depth, produce filtered depth
Pose optimizer: given constraints, find optimal pose

Challenges:

Floating point comparisons need tolerances
Random initialization needs seeded tests
Performance tests (not just correctness)

We have 2,000+ unit tests. Run on every commit.

Integration Tests

Testing component boundaries:

Camera driver delivers images to feature extractor
Feature extractor feeds SLAM
SLAM updates pose service

Mock dependencies, test data flow and error handling.

Key integration points we test:

Sensor to algorithm handoff
Algorithm to algorithm handoff
Algorithm to API surface

System Tests

Full pipeline on recorded data:

Regression datasets: Collection of challenging sequences

Low light sequences
Fast motion sequences
Dynamic scene sequences
Multi-room sequences

Ground truth: Motion capture or surveyed markers provide reference.

Metrics:

Absolute Trajectory Error (ATE)
Relative Pose Error (RPE)
Tracking loss events
Feature count over time

Automated dashboard tracks metrics across commits. Regressions block merge.

Field Validation

Real devices in real environments:

Beta users with instrumented builds
Telemetry aggregation
Failure classification

Field validation catches what lab testing misses:

Edge cases we didn't imagine
Environmental factors we don't control
User behaviors we didn't anticipate

Synthetic Test Generation

Can we generate tests automatically?

Fuzz testing: Random inputs to find crashes

Random images → feature detector (should never crash)
Random depths → mesh builder (should handle gracefully)

Property-based testing: Define invariants, generate test cases

Tracking should be consistent: forward then backward = original pose
Depth filtering should not increase noise
Pose optimization should decrease error

Flaky Tests

Perception tests are often flaky:

Numerical precision varies across platforms
Multi-threaded code has race conditions
Random initialization causes variance

Solutions:

Deterministic random seeds
Tolerance-based comparisons
Retry with logging on failure
Quarantine persistently flaky tests

Current flake rate: 0.3% (down from 5% a year ago).

CI/CD Integration

Test execution is automated:

Commit → Unit tests (2 min)
PR → Unit + Integration (15 min)
Merge → Full system tests (2 hours)
Daily → Extended regression (8 hours)

No human should need to run tests manually for routine development.

Comments