Neural Networks on Embedded: The Optimization Journey
Deploying neural networks on power-constrained hardware - quantization, pruning, architecture search, and the gap between research and production.
A hand tracking model that achieves state-of-the-art accuracy on a GPU server means nothing if it can't run on a headset. Bridging this gap is the work.
The Constraints
Our compute budget for hand tracking:
- Latency: under 15ms end-to-end
- Power: under 200mW average
- Memory: under 50MB model size
- Hardware: ARM CPU + DSP + limited GPU
A typical research model: 50M parameters, 10B FLOPs, 100ms inference on mobile GPU.
We need 100x improvement.
Optimization Strategies
Architecture Efficiency
Start with an efficient architecture:
- MobileNet-style depthwise separable convolutions: 8-9x fewer operations than standard conv
- EfficientNet compound scaling: Balanced depth/width/resolution
- Inverted residuals: Expand-depthwise-project pattern
We've moved from ResNet-50 (25M params) to a custom architecture (800K params) with similar accuracy.
Quantization
Moving from FP32 to INT8:
- 4x memory reduction
- 2-4x speedup on DSP/NPU
- ~1% accuracy loss if done carefully
Key techniques:
- Post-training quantization: Quick but accuracy loss
- Quantization-aware training: Train with quantization in the loop, better accuracy
- Mixed precision: Keep sensitive layers in higher precision
We use quantization-aware training with INT8 weights and activations.
Pruning
Remove unnecessary parameters:
- Structured pruning: Remove entire filters/channels (hardware-friendly)
- Unstructured pruning: Remove individual weights (high sparsity but sparse ops less supported)
We achieve 50% structured sparsity with under 2% accuracy drop.
Knowledge Distillation
Train small model to mimic large model:
Loss = α × TaskLoss + (1-α) × DistillationLoss
The large "teacher" model provides softer targets that transfer more information than hard labels.
5-10% accuracy improvement over training small model directly.
The Deployment Stack
PyTorch Model → ONNX Export → Target Compiler →
Optimized Kernels → Runtime Inference
Each step can introduce errors or performance regressions. Need automated testing at each stage.
Lessons Learned
-
Design for deployment from the start: Retrofitting efficiency into a research model is painful. Constraints should inform architecture.
-
Profile on target hardware: Desktop GPU profiling doesn't predict DSP performance. Always measure on actual silicon.
-
Accuracy vs latency curves: Know where you are on this curve. Sometimes 5% accuracy is worth 2x speed.
-
Beware the long tail: Average performance doesn't capture worst-case. Some inputs may take 3x longer.
Current status: 12ms inference on target hardware. Within budget, but no margin.