TL;DR
PoseFM introduces a novel generative framework for monocular visual odometry using flow matching, modeling camera motion as a distribution for improved robustness and uncertainty estimation.
Contribution
It reformulates monocular VO as a generative task with flow matching, enabling uncertainty modeling and robust motion inference in challenging conditions.
Findings
Achieves competitive ATE on TartanAir, KITTI, and TUM-RGBD benchmarks.
Provides the lowest ATE on some trajectories among monocular VO methods.
Code and models are available at the specified GitHub repository.
Abstract
Monocular visual odometry (VO) is a fundamental computer vision problem with applications in autonomous navigation, augmented reality and more. While deep learning-based methods have recently shown superior accuracy compared to traditional geometric pipelines, particularly in environments where handcrafted features struggle due to poor structure or lighting conditions, most rely on deterministic regression, which lacks the uncertainty awareness required for robust applications. We propose PoseFM, the first framework to reformulate monocular frame-to-frame VO as a generative task using Flow Matching (FM). By leveraging FM, we model camera motion as a distribution rather than a point estimate, learning to transform noise into realistic pose predictions via continuous-time ODEs. This approach provides a principled mechanism for uncertainty estimation and enables robust motion inference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
