cd ~/

Building a Perception Testing Framework

Systematic testing infrastructure for perception systems - from unit tests to integration tests to field validation.

Evyatar Bluzer
3 min read

Perception systems are notoriously hard to test. Inputs are high-dimensional (images, depth maps), outputs are continuous (poses, keypoints), and "correct" often means "good enough." Here's how we built a testing framework.

Testing Pyramid for Perception

            /\
           /  \  Field Validation
          /    \  (real users, real environments)
         /──────\
        /        \ System Tests
       /          \ (full pipeline, recorded data)
      /────────────\
     /              \ Integration Tests
    /                \ (component boundaries)
   /──────────────────\
  /                    \ Unit Tests
 /                      \ (individual functions)
/────────────────────────\

Each level catches different bugs. All are necessary.

Unit Tests

Testing individual functions:

  • Feature detector: given image patch, detect correct corners
  • Depth filter: given noisy depth, produce filtered depth
  • Pose optimizer: given constraints, find optimal pose

Challenges:

  • Floating point comparisons need tolerances
  • Random initialization needs seeded tests
  • Performance tests (not just correctness)

We have 2,000+ unit tests. Run on every commit.

Integration Tests

Testing component boundaries:

  • Camera driver delivers images to feature extractor
  • Feature extractor feeds SLAM
  • SLAM updates pose service

Mock dependencies, test data flow and error handling.

Key integration points we test:

  • Sensor to algorithm handoff
  • Algorithm to algorithm handoff
  • Algorithm to API surface

System Tests

Full pipeline on recorded data:

Regression datasets: Collection of challenging sequences

  • Low light sequences
  • Fast motion sequences
  • Dynamic scene sequences
  • Multi-room sequences

Ground truth: Motion capture or surveyed markers provide reference.

Metrics:

  • Absolute Trajectory Error (ATE)
  • Relative Pose Error (RPE)
  • Tracking loss events
  • Feature count over time

Automated dashboard tracks metrics across commits. Regressions block merge.

Field Validation

Real devices in real environments:

  • Beta users with instrumented builds
  • Telemetry aggregation
  • Failure classification

Field validation catches what lab testing misses:

  • Edge cases we didn't imagine
  • Environmental factors we don't control
  • User behaviors we didn't anticipate

Synthetic Test Generation

Can we generate tests automatically?

Fuzz testing: Random inputs to find crashes

  • Random images → feature detector (should never crash)
  • Random depths → mesh builder (should handle gracefully)

Property-based testing: Define invariants, generate test cases

  • Tracking should be consistent: forward then backward = original pose
  • Depth filtering should not increase noise
  • Pose optimization should decrease error

Flaky Tests

Perception tests are often flaky:

  • Numerical precision varies across platforms
  • Multi-threaded code has race conditions
  • Random initialization causes variance

Solutions:

  • Deterministic random seeds
  • Tolerance-based comparisons
  • Retry with logging on failure
  • Quarantine persistently flaky tests

Current flake rate: 0.3% (down from 5% a year ago).

CI/CD Integration

Test execution is automated:

  • Commit → Unit tests (2 min)
  • PR → Unit + Integration (15 min)
  • Merge → Full system tests (2 hours)
  • Daily → Extended regression (8 hours)

No human should need to run tests manually for routine development.

Comments