6DoF Localization: From Image to Pose

The core VPS query: "Given an image from a device, where is that device in the world?"

The Localization Pipeline

Query Image → Feature Extraction → Image Retrieval →
Feature Matching → Pose Estimation → Verification → 6DoF Pose

Each stage has multiple approaches. Here's what we're building.

Convert image to compact representation for matching.

Classical approaches (SIFT, ORB):

Learned approaches (SuperPoint, D2-Net, R2D2):

We use learned features for robustness, with classical fallback for edge cases.

Find which part of the map database is relevant.

Global descriptors: Compress entire image to single vector (NetVLAD, GeM).

Query: Which database images are most similar to this query image?

At world scale: billions of database images. Need efficient nearest-neighbor search.

Solutions:

Target: Top-100 candidates in under 100ms.

Match query features to 3D map features.

For each query keypoint:

Challenges:

Given 2D-3D correspondences, compute camera pose.

PnP (Perspective-n-Point): Classic approach

2D image points + 3D world points → 6DoF pose

With outliers: RANSAC + PnP

Typical: 50-100 inliers needed for robust pose.

Not every pose estimate is correct. Verify before trusting.

Checks:

Confidence score determines whether to use pose or fall back to GPS.

End-to-end target: under 500ms from image capture to pose.

On-device optimization critical for feature extraction. Cloud optimization for retrieval.

VPS fails gracefully:

Users should trust VPS when it's confident, and know when it's not.