XVO: Generalized Visual Odometry via Cross-Modal Self-Training
Lei Lai, Zhongkai Shangguan, Jimuyang Zhang, Eshed Ohn-Bar

TL;DR
XVO introduces a semi-supervised, multi-modal approach to monocular visual odometry that learns from diverse, real-world dashcam videos without relying on camera calibration, achieving robust, generalized pose estimation.
Contribution
The paper presents a novel semi-supervised learning framework with multi-modal supervision for generalized monocular visual odometry, enabling off-the-shelf performance across datasets.
Findings
Achieves state-of-the-art results on KITTI without multi-frame optimization.
Effectively transfers knowledge across diverse datasets without fine-tuning.
Audio auxiliary tasks significantly improve learning in dynamic, out-of-domain videos.
Abstract
We propose XVO, a semi-supervised learning method for training generalized monocular Visual Odometry (VO) models with robust off-the-self operation across diverse datasets and settings. In contrast to standard monocular VO approaches which often study a known calibration within a single dataset, XVO efficiently learns to recover relative pose with real-world scale from visual scene semantics, i.e., without relying on any known camera parameters. We optimize the motion estimation model via self-training from large amounts of unconstrained and heterogeneous dash camera videos available on YouTube. Our key contribution is twofold. First, we empirically demonstrate the benefits of semi-supervised training for learning a general-purpose direct VO regression network. Second, we demonstrate multi-modal supervision, including segmentation, flow, depth, and audio auxiliary prediction tasks, to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Robotics and Sensor-Based Localization · Image Processing Techniques and Applications
