RoMeO: Robust Metric Visual Odometry
Junda Cheng, Zhipeng Cai, Zhaoxing Zhang, Wei Yin, Matthias Muller,, Michael Paulitsch, Xin Yang

TL;DR
RoMeO introduces a robust monocular visual odometry method that leverages pre-trained depth priors to improve accuracy, robustness, and metric-scale recovery across diverse indoor and outdoor datasets.
Contribution
The paper presents RoMeO, a novel monocular VO approach that integrates depth priors and noise filtering to enhance robustness and generalization, outperforming existing methods significantly.
Findings
RoMeO reduces trajectory errors by over 50% compared to SOTA.
It generalizes well to diverse indoor and outdoor scenes.
Performance improvements extend to full SLAM pipelines.
Abstract
Visual odometry (VO) aims to estimate camera poses from visual inputs -- a fundamental building block for many applications such as VR/AR and robotics. This work focuses on monocular RGB VO where the input is a monocular RGB video without IMU or 3D sensors. Existing approaches lack robustness under this challenging scenario and fail to generalize to unseen data (especially outdoors); they also cannot recover metric-scale poses. We propose Robust Metric Visual Odometry (RoMeO), a novel method that resolves these issues leveraging priors from pre-trained depth models. RoMeO incorporates both monocular metric depth and multi-view stereo (MVS) models to recover metric-scale, simplify correspondence search, provide better initialization and regularize optimization. Effective strategies are proposed to inject noise during training and adaptively filter noisy depth priors, which ensure the…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
The method seems to make reasonable use of monocular depth estimates to improve visual odometry (but results are not as clear given results pollution with KITTI overfitting, see below).
main issue: Several of the comparisons are problematic. The teaser compares to a method that does not estimate absolute scale (Fig.1 (b) right-side) and the proposed method is explicitly overfitted to KITTI data (uses a depth method which was explicitly overfitted to KITTI, which is particularly problematic for a driving setting with a fixed camera-road configuration throughout the whole dataset which allows almost trivially perfect scale estimates, especially for the road in front of the vehi
+ The proposed method leads to improved accuracy of the visual odometry, especially in outdoor scenes where DROID_SLAM performs badly. + The adopted strategies bring improvement compared to the naive solution of DROID_Metric3d baseline. + Thorough experiments are conducted on diverse datasets, and the proposed method achieves the best results consistently.
The major concern is that the paper in the existing form provides little knowledge advances beyond better results. I don't doubt its efficacy on these evaluated datasets, but the straightforward solution requires better analysis to uncover the major issues regarding leveraging a pre-trained depth estimation model: + What are the typical failure modes on diverse datasets with a naive DROID-Metric3D baseline? + Is there any domain gap between different datasets? + What are the differences betwee
* The experimental results in this paper are thorough. The authors use the TartanAir dataset for pre-training and conduct evaluations on three indoor and three outdoor datasets. The results are consistent and achieve state-of-the-art (SOTA) accuracy on most datasets. Notably, the proposed method shows substantial improvements in outdoor datasets, which strongly supports the effectiveness of the approach. * The results of this paper have practical implications. Theoretically, monocular RGB VO is
* The paper's motivation is questionable. Although the results are insightful, the motivation is not sufficiently compelling. The authors claim that introducing pre-trained monocular depth and multi-view stereo neural networks helps recover scene scale. However, the scale of each frame is merely a scalar, and using dense depth priors for this purpose significantly increases the system's computational load (with system efficiency below 30 FPS, making it non-real-time). A more elegant, well-design
1. The paper shows superior camera tracking performance against learning-free and learning-based VO methods. As part of the SLAM system, it achieves the best performance compared to DROID-SLAM and ORB-SLAM3 approaches. 2. The paper is well-written, the experiments are extensive, and the superiority of the proposed system in camera tracking is demonstrated across several benchmarks. 3. Analysis of different depth estimators provides interesting insights into the right supervision for visual odome
1. The pipeline flow is not completely clear from the method description and Figure 2. It would be beneficial to separate the trained or frozen components (Depth, Flow, MVS Networks) from the test-time optimization (differentiable BA, sliding window optimization) in the figure and the text. A figure for the Depth-guided BA (Section 3.1) seems necessary to understand the optimization procedure. 2. No demo is provided to demonstrate the proposed system's pipeline step-by-step and performance.
1) The proposed method demonstrates superior performance compared to state-of-the-art methods across various benchmarks.
1) The technical contribution and insights appear limited. The integration of additional depth estimation models to enhance the performance of monocular VO or SLAM systems has become a well-recognized and practical solution. The primary distinction of this paper compared to previous literature is the incorporation of both monocular depth estimation models and multi-view stereo (MVS) models. 2) The paper presents a less rigorous concept that could potentially mislead the community. The authors cl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Sensor-Based Localization · Advanced Vision and Imaging · Image and Object Detection Techniques
