DINO-VO: Learning Where to Focus for Enhanced State Estimation

Qi Chen; Guanghao Li; Sijia Hu; Xin Gao; Junpeng Ma; Xiangyang Xue; Jian Pu

arXiv:2604.04055·cs.CV·April 7, 2026

DINO-VO: Learning Where to Focus for Enhanced State Estimation

Qi Chen, Guanghao Li, Sijia Hu, Xin Gao, Junpeng Ma, Xiangyang Xue, Jian Pu

PDF

TL;DR

DINO-VO is an end-to-end monocular visual odometry system that uses adaptive patch selection and multi-task learning to improve accuracy and generalization across diverse environments.

Contribution

It introduces a differentiable adaptive patch selector and a multi-task feature extraction module with bundle adjustment, enhancing robustness and generalization in VO.

Findings

01

Achieves state-of-the-art tracking accuracy on multiple datasets.

02

Demonstrates strong generalization across synthetic, indoor, and outdoor environments.

03

Outperforms existing VO systems in accuracy and robustness.

Abstract

We present DINO Patch Visual Odometry (DINO-VO), an end-to-end monocular visual odometry system with strong scene generalization. Current Visual Odometry (VO) systems often rely on heuristic feature extraction strategies, which can degrade accuracy and robustness, particularly in large-scale outdoor environments. DINO-VO addresses these limitations by incorporating a differentiable adaptive patch selector into the end-to-end pipeline, improving the quality of extracted patches and enhancing generalization across diverse datasets. Additionally, our system integrates a multi-task feature extraction module with a differentiable bundle adjustment (BA) module that leverages inverse depth priors, enabling the system to learn and utilize appearance and geometric information effectively. This integration bridges the gap between feature learning and state estimation. Extensive experiments on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.