Learned Features for Visual Localization
Moving from hand-crafted to learned feature descriptors for VPS - training, deployment, and performance gains.
Classical features (SIFT, ORB) have served well, but learned features are now beating them significantly. Time to make the switch.
The Case for Learned Features
Classical features struggle with:
- Large viewpoint changes (>30° rotation)
- Illumination changes (day vs night)
- Seasonal changes (snow, foliage)
- Weather (rain, fog)
Learned features, trained on diverse data, handle these better.
Benchmarks (HPatches, Aachen Day-Night):
- SIFT: 45% localization success
- SuperPoint + SuperGlue: 78% localization success
The gap is real and significant.
Feature Learning Approaches
Detection + Description (SuperPoint style)
Train network to jointly detect keypoints and compute descriptors.
Image → CNN → Keypoint Heatmap + Descriptor Map
Advantages: End-to-end trained, fast inference Challenges: Fixed grid output, quantization effects
Dense Description (D2-Net style)
Describe every pixel, detect from description scores.
Image → CNN → Dense Descriptors → Keypoint Detection
Advantages: No separate detection, more flexible Challenges: Slower, more memory
Hierarchical (HLoc style)
Combine global retrieval network with local features.
Image → Global Net → Candidates → Local Features → Pose
Advantages: Best accuracy, handles large databases Challenges: Multiple networks, complex pipeline
Training for VPS
Off-the-shelf models are trained on academic datasets. We need:
- More geographic diversity
- More condition diversity
- Quest-specific camera characteristics
Training data sources:
- Synthetic renders (full control, unlimited)
- Real captures from VPS mapping (authentic but limited)
- Public datasets (diversity but no control)
Training approach:
- Pre-train on large public data
- Fine-tune on VPS-specific data
- Domain adaptation for synthetic-to-real transfer
Deployment Considerations
Learned features are more expensive than classical:
- Model size: 10MB+ (vs. minimal for ORB)
- Inference: 30-50ms on mobile (vs. 5ms for ORB)
- Memory: Feature maps consume GPU memory
Optimizations:
- Knowledge distillation to smaller student network
- INT8 quantization with minimal accuracy loss
- TensorRT/NNAPI deployment for hardware acceleration
Current student model: 2MB, 25ms on Quest.
A/B Testing Plan
Rolling out gradually:
- Internal dogfooding with learned features
- Small percentage external users
- Measure localization success rate
- Expand if metrics improve
Fallback: Classical features still available if learned fails.
Results So Far
Internal testing shows:
- 40% reduction in localization failures
- Better performance in challenging lighting
- Comparable latency after optimization
Planning broader rollout Q2 2021.