Self-localization on a 3D map by fusing global and local features from a monocular camera
Satoshi Kikuchi, Masaya Kato, Tsuyoshi Tasaki

TL;DR
This paper introduces a self-localization method combining CNN and Vision Transformer to improve accuracy in dynamic environments using a monocular camera, outperforming state-of-the-art approaches.
Contribution
The study proposes a novel fusion of CNN and Vision Transformer for enhanced global and local feature extraction in monocular camera-based localization.
Findings
Accuracy improvement rate 1.5 times higher with dynamic obstacles
Self-localization error reduced by 20.1% compared to SOTA
Robot localization error averaged 7.51cm, more accurate than previous methods
Abstract
Self-localization on a 3D map by using an inexpensive monocular camera is required to realize autonomous driving. Self-localization based on a camera often uses a convolutional neural network (CNN) that can extract local features that are calculated by nearby pixels. However, when dynamic obstacles, such as people, are present, CNN does not work well. This study proposes a new method combining CNN with Vision Transformer, which excels at extracting global features that show the relationship of patches on whole image. Experimental results showed that, compared to the state-of-the-art method (SOTA), the accuracy improvement rate in a CG dataset with dynamic obstacles is 1.5 times higher than that without dynamic obstacles. Moreover, the self-localization error of our method is 20.1% smaller than that of SOTA on public datasets. Additionally, our robot using our method can localize itself…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
