Learned Features for Visual Localization

Classical features (SIFT, ORB) have served well, but learned features are now beating them significantly. Time to make the switch.

The Case for Learned Features

Classical features struggle with:

Learned features, trained on diverse data, handle these better.

Benchmarks (HPatches, Aachen Day-Night):

The gap is real and significant.

Train network to jointly detect keypoints and compute descriptors.

Image → CNN → Keypoint Heatmap + Descriptor Map

Advantages: End-to-end trained, fast inference Challenges: Fixed grid output, quantization effects

Describe every pixel, detect from description scores.

Image → CNN → Dense Descriptors → Keypoint Detection

Advantages: No separate detection, more flexible Challenges: Slower, more memory

Combine global retrieval network with local features.

Image → Global Net → Candidates → Local Features → Pose

Advantages: Best accuracy, handles large databases Challenges: Multiple networks, complex pipeline

Off-the-shelf models are trained on academic datasets. We need:

Training data sources:

Training approach:

Learned features are more expensive than classical:

Optimizations:

Current student model: 2MB, 25ms on Quest.

Rolling out gradually:

Fallback: Classical features still available if learned fails.

Internal testing shows:

Planning broader rollout Q2 2021.