Chip Architecture Hardware Perception Computing Magic Leap

Architecting Compute Silicon for Perception Workloads

Influencing next-generation compute chip architecture for perception - what hardware do SLAM, depth processing, and ML inference actually need?

Evyatar Bluzer

November 19, 2018

3 min read

We're starting to influence the compute architecture for future devices. What does perception actually need from silicon?

Current Limitations

On the current chip (mobile SoC + custom accelerators):

CPU handles control flow and fusion
DSP runs feature extraction and filtering
GPU runs ML inference (slowly)
Custom blocks handle specific functions

Pain points:

Data movement between units burns power
Memory bandwidth is the bottleneck
GPU inference is power-hungry for small models
Fixed-function blocks lack flexibility

Perception Workload Analysis

Breaking down where cycles go:

Workload	% Compute	Characteristics
Feature extraction	25%	Parallel, local ops
Depth processing	20%	Filtering, interpolation
ML inference	30%	Matrix ops, nonlinear
Tracking/SLAM	15%	Sparse linear algebra
Control/fusion	10%	Sequential, branchy

Each has different optimal compute architecture.

What We Need

Efficient ML Accelerator

INT8 matrix multiply (95% of inference)
Flexible enough for various network shapes
Low power (10-50 TOPS/W target)
Low latency startup (no batch amortization)

Image Processing Unit

2D convolution engine
Distortion correction (LUT-based)
Feature detection (Harris, ORB)
Stream processing (minimize memory round-trips)

Memory Architecture

High bandwidth for tensor operations
Low latency for sparse access (SLAM)
Scratchpad for intermediate results
DMA engines for background data movement

Compute Fabric

Ability to pipeline operations across units
Minimal CPU involvement in data flow
Power gating for unused units

Trade-offs in Discussion

Fixed function vs programmable: Fixed is efficient but inflexible. Programmable handles algorithm changes but wastes area/power.

Recommendation: Fixed for stable algorithms (distortion, simple filtering), programmable for evolving algorithms (ML, SLAM).

On-chip vs off-chip memory: On-chip is fast and efficient but limited. Off-chip is large but power-hungry.

Recommendation: Large on-chip scratchpad for inference (models fit), accept off-chip for mapping (data doesn't fit).

Separate NPU vs GPU compute: Dedicated NPU is efficient. GPU compute is flexible.

Recommendation: Dedicated NPU for inference, reserve GPU for graphics. Don't share.

Working with the Chip Team

The chip design cycle is 2-3 years. Decisions made now determine what's possible in 2021.

My role:

Provide workload characterization with cycle-accurate models
Benchmark competing architectures on our algorithms
Define KPIs that matter (not just TOPS, but TOPS/W at our model sizes)
Review architecture proposals for perception fit

It's a different kind of engineering - influencing hardware through analysis rather than writing code. But the leverage is enormous.

Comments