Robust Visual Localization via Semantic-Guided Multi-Scale Transformer

Zhongtao Tian; Wenhao Huang; Zhidong Chen; Xiao Wei Sun

arXiv:2506.08526·cs.CV·June 11, 2025

Robust Visual Localization via Semantic-Guided Multi-Scale Transformer

Zhongtao Tian, Wenhao Huang, Zhidong Chen, Xiao Wei Sun

PDF

Open Access

TL;DR

This paper introduces a novel semantic-guided multi-scale Transformer framework that significantly improves visual localization accuracy in dynamic environments by combining hierarchical feature learning with semantic scene understanding.

Contribution

It presents a new approach that integrates multi-scale Transformer architecture with semantic supervision to enhance robustness in visual localization under challenging conditions.

Findings

01

Outperforms existing pose regression methods in dynamic scenarios

02

Effective in handling lighting changes, occlusions, and moving objects

03

Demonstrates robustness on TartanAir dataset

Abstract

Visual localization remains challenging in dynamic environments where fluctuating lighting, adverse weather, and moving objects disrupt appearance cues. Despite advances in feature representation, current absolute pose regression methods struggle to maintain consistency under varying conditions. To address this challenge, we propose a framework that synergistically combines multi-scale feature learning with semantic scene understanding. Our approach employs a hierarchical Transformer with cross-scale attention to fuse geometric details and contextual cues, preserving spatial precision while adapting to environmental changes. We improve the performance of this architecture with semantic supervision via neural scene representation during training, guiding the network to learn view-invariant features that encode persistent structural information while suppressing complex environmental…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Sensor-Based Localization · Robot Manipulation and Learning · Multimodal Machine Learning Applications

MethodsLinear Layer · Dense Connections · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Adam · Attention Is All You Need · Softmax · Label Smoothing · Multi-Head Attention · Dropout