Self-localization on a 3D map by fusing global and local features from a monocular camera

Satoshi Kikuchi; Masaya Kato; Tsuyoshi Tasaki

arXiv:2510.26170·cs.RO·December 19, 2025

Self-localization on a 3D map by fusing global and local features from a monocular camera

Satoshi Kikuchi, Masaya Kato, Tsuyoshi Tasaki

PDF

TL;DR

This paper introduces a self-localization method combining CNN and Vision Transformer to improve accuracy in dynamic environments using a monocular camera, outperforming state-of-the-art approaches.

Contribution

The study proposes a novel fusion of CNN and Vision Transformer for enhanced global and local feature extraction in monocular camera-based localization.

Findings

01

Accuracy improvement rate 1.5 times higher with dynamic obstacles

02

Self-localization error reduced by 20.1% compared to SOTA

03

Robot localization error averaged 7.51cm, more accurate than previous methods

Abstract

Self-localization on a 3D map by using an inexpensive monocular camera is required to realize autonomous driving. Self-localization based on a camera often uses a convolutional neural network (CNN) that can extract local features that are calculated by nearby pixels. However, when dynamic obstacles, such as people, are present, CNN does not work well. This study proposes a new method combining CNN with Vision Transformer, which excels at extracting global features that show the relationship of patches on whole image. Experimental results showed that, compared to the state-of-the-art method (SOTA), the accuracy improvement rate in a CG dataset with dynamic obstacles is 1.5 times higher than that without dynamic obstacles. Moreover, the self-localization error of our method is 20.1% smaller than that of SOTA on public datasets. Additionally, our robot using our method can localize itself…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.