VOCAL: Visual Odometry via ContrAstive Learning

Chi-Yao Huang; Zeel Bhatt; Yezhou Yang

arXiv:2507.00243·cs.CV·January 26, 2026

VOCAL: Visual Odometry via ContrAstive Learning

Chi-Yao Huang, Zeel Bhatt, Yezhou Yang

PDF

Open Access

TL;DR

VOCAL introduces a novel contrastive learning framework for visual odometry that improves interpretability and robustness by framing VO as a label ranking problem and integrating Bayesian inference.

Contribution

It redefines visual odometry as a label ranking task using contrastive learning, enhancing interpretability and multimodal data compatibility within a data-driven framework.

Findings

01

Outperforms existing methods on KITTI dataset

02

Improves interpretability of visual odometry models

03

Enhances flexibility with multimodal data

Abstract

Breakthroughs in visual odometry (VO) have fundamentally reshaped the landscape of robotics, enabling ultra-precise camera state estimation that is crucial for modern autonomous systems. Despite these advances, many learning-based VO techniques rely on rigid geometric assumptions, which often fall short in interpretability and lack a solid theoretical basis within fully data-driven frameworks. To overcome these limitations, we introduce VOCAL (Visual Odometry via ContrAstive Learning), a novel framework that reimagines VO as a label ranking challenge. By integrating Bayesian inference with a representation learning framework, VOCAL organizes visual features to mirror camera states. The ranking mechanism compels similar camera states to converge into consistent and spatially coherent representations within the latent space. This strategic alignment not only bolsters the interpretability…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Sensor-Based Localization · Advanced Vision and Imaging · Multimodal Machine Learning Applications