cd ~/

Architecting Compute Silicon for Perception Workloads

Influencing next-generation compute chip architecture for perception - what hardware do SLAM, depth processing, and ML inference actually need?

Evyatar Bluzer
3 min read

We're starting to influence the compute architecture for future devices. What does perception actually need from silicon?

Current Limitations

On the current chip (mobile SoC + custom accelerators):

  • CPU handles control flow and fusion
  • DSP runs feature extraction and filtering
  • GPU runs ML inference (slowly)
  • Custom blocks handle specific functions

Pain points:

  • Data movement between units burns power
  • Memory bandwidth is the bottleneck
  • GPU inference is power-hungry for small models
  • Fixed-function blocks lack flexibility

Perception Workload Analysis

Breaking down where cycles go:

Workload% ComputeCharacteristics
Feature extraction25%Parallel, local ops
Depth processing20%Filtering, interpolation
ML inference30%Matrix ops, nonlinear
Tracking/SLAM15%Sparse linear algebra
Control/fusion10%Sequential, branchy

Each has different optimal compute architecture.

What We Need

Efficient ML Accelerator

  • INT8 matrix multiply (95% of inference)
  • Flexible enough for various network shapes
  • Low power (10-50 TOPS/W target)
  • Low latency startup (no batch amortization)

Image Processing Unit

  • 2D convolution engine
  • Distortion correction (LUT-based)
  • Feature detection (Harris, ORB)
  • Stream processing (minimize memory round-trips)

Memory Architecture

  • High bandwidth for tensor operations
  • Low latency for sparse access (SLAM)
  • Scratchpad for intermediate results
  • DMA engines for background data movement

Compute Fabric

  • Ability to pipeline operations across units
  • Minimal CPU involvement in data flow
  • Power gating for unused units

Trade-offs in Discussion

Fixed function vs programmable: Fixed is efficient but inflexible. Programmable handles algorithm changes but wastes area/power.

Recommendation: Fixed for stable algorithms (distortion, simple filtering), programmable for evolving algorithms (ML, SLAM).

On-chip vs off-chip memory: On-chip is fast and efficient but limited. Off-chip is large but power-hungry.

Recommendation: Large on-chip scratchpad for inference (models fit), accept off-chip for mapping (data doesn't fit).

Separate NPU vs GPU compute: Dedicated NPU is efficient. GPU compute is flexible.

Recommendation: Dedicated NPU for inference, reserve GPU for graphics. Don't share.

Working with the Chip Team

The chip design cycle is 2-3 years. Decisions made now determine what's possible in 2021.

My role:

  • Provide workload characterization with cycle-accurate models
  • Benchmark competing architectures on our algorithms
  • Define KPIs that matter (not just TOPS, but TOPS/W at our model sizes)
  • Review architecture proposals for perception fit

It's a different kind of engineering - influencing hardware through analysis rather than writing code. But the leverage is enormous.

Comments