DINO-VO: A Feature-based Visual Odometry Leveraging a Visual Foundation Model

Maulana Bisyir Azhari; David Hyunchul Shim

arXiv:2507.13145·cs.CV·July 18, 2025

DINO-VO: A Feature-based Visual Odometry Leveraging a Visual Foundation Model

Maulana Bisyir Azhari, David Hyunchul Shim

PDF

Open Access

TL;DR

DINO-VO introduces a novel feature-based visual odometry system that leverages DINOv2 foundation model features, combining semantic and geometric information for robust, accurate, and efficient camera motion estimation across diverse environments.

Contribution

The paper presents DINO-VO, a new VO system integrating DINOv2 features with a tailored keypoints detector and transformer-based matching, improving robustness and accuracy over prior methods.

Findings

01

Outperforms prior VO methods on TartanAir and KITTI datasets.

02

Achieves real-time processing at 72 FPS with low memory usage.

03

Demonstrates strong generalization and competitive performance with SLAM systems.

Abstract

Learning-based monocular visual odometry (VO) poses robustness, generalization, and efficiency challenges in robotics. Recent advances in visual foundation models, such as DINOv2, have improved robustness and generalization in various vision tasks, yet their integration in VO remains limited due to coarse feature granularity. In this paper, we present DINO-VO, a feature-based VO system leveraging DINOv2 visual foundation model for its sparse feature matching. To address the integration challenge, we propose a salient keypoints detector tailored to DINOv2's coarse features. Furthermore, we complement DINOv2's robust-semantic features with fine-grained geometric features, resulting in more localizable representations. Finally, a transformer-based matcher and differentiable pose estimation layer enable precise camera motion estimation by learning good matches. Against prior…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Advanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization