XVO: Generalized Visual Odometry via Cross-Modal Self-Training

Lei Lai; Zhongkai Shangguan; Jimuyang Zhang; Eshed Ohn-Bar

arXiv:2309.16772·cs.CV·October 10, 2023

XVO: Generalized Visual Odometry via Cross-Modal Self-Training

Lei Lai, Zhongkai Shangguan, Jimuyang Zhang, Eshed Ohn-Bar

PDF

Open Access

TL;DR

XVO introduces a semi-supervised, multi-modal approach to monocular visual odometry that learns from diverse, real-world dashcam videos without relying on camera calibration, achieving robust, generalized pose estimation.

Contribution

The paper presents a novel semi-supervised learning framework with multi-modal supervision for generalized monocular visual odometry, enabling off-the-shelf performance across datasets.

Findings

01

Achieves state-of-the-art results on KITTI without multi-frame optimization.

02

Effectively transfers knowledge across diverse datasets without fine-tuning.

03

Audio auxiliary tasks significantly improve learning in dynamic, out-of-domain videos.

Abstract

We propose XVO, a semi-supervised learning method for training generalized monocular Visual Odometry (VO) models with robust off-the-self operation across diverse datasets and settings. In contrast to standard monocular VO approaches which often study a known calibration within a single dataset, XVO efficiently learns to recover relative pose with real-world scale from visual scene semantics, i.e., without relying on any known camera parameters. We optimize the motion estimation model via self-training from large amounts of unconstrained and heterogeneous dash camera videos available on YouTube. Our key contribution is twofold. First, we empirically demonstrate the benefits of semi-supervised training for learning a general-purpose direct VO regression network. Second, we demonstrate multi-modal supervision, including segmentation, flow, depth, and audio auxiliary prediction tasks, to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Robotics and Sensor-Based Localization · Image Processing Techniques and Applications