Perception System Architecture: Putting It All Together

After months of deep-diving into individual components - depth sensors, SLAM, sensor fusion - it's time to step back and architect the complete perception system.

System Requirements

What we need to deliver:

6DoF head tracking: under 1mm position, under 0.1° orientation accuracy
Environment mapping: Room-scale 3D mesh, ~1cm resolution
Plane detection: Horizontal and vertical surfaces for content placement
Hand tracking: 25-joint hand skeleton at 30Hz (stretch goal)
Eye tracking: Gaze vector for foveated rendering and interaction

All within 1.5W perception budget and under 20ms latency.

Sensor Suite

After extensive prototyping:

Sensor	Purpose	Resolution	Rate
Tracking cameras (x2)	VIO, SLAM	640x480	60Hz
Depth camera	Meshing, plane detection	320x240	30Hz
Eye cameras (x2)	Eye tracking	320x320	90Hz
IMU	High-rate motion	-	1kHz

The two tracking cameras provide stereo and wide coverage. Depth camera supplements with metric depth for meshing. Eye cameras run faster for responsive gaze tracking.

Processing Architecture

                    ┌─────────────────────┐
                    │      DSP Core       │
                    │  - Feature extract  │
                    │  - Depth filtering  │
                    └──────────┬──────────┘
                               │
┌──────────┐         ┌────────▼────────┐         ┌──────────┐
│  Sensors │────────►│    CPU Cores    │────────►│ Display  │
│          │         │  - VIO/SLAM     │         │ Pipeline │
└──────────┘         │  - Fusion       │         └──────────┘
                     │  - Eye tracking │
                     └────────┬────────┘
                              │
                    ┌─────────▼─────────┐
                    │    GPU/NPU        │
                    │  - Meshing        │
                    │  - Hand tracking  │
                    │  - ML inference   │
                    └───────────────────┘

Data Flow

Critical path (head tracking):

Camera frame captured (t=0)
IMU propagation for immediate pose (t=1ms)
Feature extraction on DSP (t=5ms)
VIO update on CPU (t=10ms)
Pose delivered to display (t=12ms)

Total latency: 12ms, leaving 8ms buffer for display pipeline.

Interface Contracts

Clear APIs between subsystems:

Pose service: Provides head pose at any timestamp (interpolated/extrapolated)
Map service: Spatial anchors, meshes, plane primitives
Gaze service: Eye gaze rays for rendering and input

Each service has defined latency, accuracy, and failure mode contracts.

Open Questions

Still wrestling with:

Depth vs stereo for meshing - depth is better but more power
Eye tracking accuracy requirements - depends on display architecture
Persistent maps - how much storage, privacy implications

December will be spec freeze. Time to commit.