cd ~/

ML Inference Optimization: Squeezing Every FLOP

Advanced techniques for neural network optimization on embedded systems - beyond basic quantization and pruning.

Evyatar Bluzer
3 min read

We've done the obvious optimizations (INT8, pruning, efficient architectures). V2's power budget demands going deeper.

The Remaining Gap

Current hand tracking model:

  • Size: 800KB (INT8)
  • Compute: 100M ops per frame
  • Latency: 12ms on target NPU
  • Power: 180mW

V2 targets:

  • Better accuracy (2x model capacity)
  • Lower latency (8ms)
  • Lower power (120mW)

We need more efficiency without sacrificing accuracy.

Advanced Techniques

Structured Sparsity

Random sparsity doesn't help - hardware can't skip random zeros.

Structured sparsity removes entire structures:

  • 2:4 sparsity: 2 zeros in every 4 elements (hardware accelerated on some NPUs)
  • Channel pruning: Remove entire feature channels
  • Block sparsity: Zero out NxN blocks

We're targeting 2:4 sparsity which our V2 NPU supports natively. 2x theoretical speedup.

Knowledge Distillation (Advanced)

Beyond basic distillation:

Attention transfer: Match intermediate attention maps, not just outputs Feature mimicry: Student learns to reproduce teacher's internal representations Progressive distillation: Teacher → medium model → small model

3-5% accuracy improvement over direct training.

Neural Architecture Search (NAS)

Don't design architectures - search for them.

Search space:

  • Layer types (conv, depthwise, attention)
  • Layer sizes (channels, kernel size)
  • Connections (skip connections, bottlenecks)

Objective: maximize accuracy subject to latency/power constraint.

We ran NAS for 2 weeks on cloud TPUs. Found architecture 15% more efficient than manual design.

Operator Fusion

Individual operations have overhead (kernel launch, memory access). Fusing them helps:

Conv → BatchNorm → ReLU → Conv → BatchNorm → ReLU
                    ↓
        FusedConvBNReLU → FusedConvBNReLU

Fused operators:

  • Single kernel launch
  • Intermediate results stay in registers
  • 20-30% speedup for common patterns

Precision Optimization

Not all layers need same precision:

  • First layer: FP16 (sensitive to input quantization)
  • Middle layers: INT8 (bulk compute, tolerant)
  • Last layer: INT8 or FP16 depending on output sensitivity

Mixed-precision analysis identified 3 layers needing FP16. Others safely INT8.

Memory Layout Optimization

Data layout affects cache efficiency:

  • NCHW vs NHWC depends on hardware
  • Tiled layouts for better locality
  • Ping-pong buffers to hide memory latency

Working with chip team to identify optimal layouts for V2 NPU.

Cumulative Impact

Applying all techniques:

TechniqueSpeedupAccuracy Impact
Structured sparsity1.5x-0.5%
Advanced distillation-+3%
NAS architecture1.15x+1%
Operator fusion1.25x0%
Mixed precision1.1x-0.2%

Combined: ~2x efficiency with net accuracy gain.

Hitting V2 targets looks achievable.

Comments