Robust Visual Localization via Semantic-Guided Multi-Scale Transformer
Zhongtao Tian, Wenhao Huang, Zhidong Chen, Xiao Wei Sun

TL;DR
This paper introduces a novel semantic-guided multi-scale Transformer framework that significantly improves visual localization accuracy in dynamic environments by combining hierarchical feature learning with semantic scene understanding.
Contribution
It presents a new approach that integrates multi-scale Transformer architecture with semantic supervision to enhance robustness in visual localization under challenging conditions.
Findings
Outperforms existing pose regression methods in dynamic scenarios
Effective in handling lighting changes, occlusions, and moving objects
Demonstrates robustness on TartanAir dataset
Abstract
Visual localization remains challenging in dynamic environments where fluctuating lighting, adverse weather, and moving objects disrupt appearance cues. Despite advances in feature representation, current absolute pose regression methods struggle to maintain consistency under varying conditions. To address this challenge, we propose a framework that synergistically combines multi-scale feature learning with semantic scene understanding. Our approach employs a hierarchical Transformer with cross-scale attention to fuse geometric details and contextual cues, preserving spatial precision while adapting to environmental changes. We improve the performance of this architecture with semantic supervision via neural scene representation during training, guiding the network to learn view-invariant features that encode persistent structural information while suppressing complex environmental…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Sensor-Based Localization · Robot Manipulation and Learning · Multimodal Machine Learning Applications
MethodsLinear Layer · Dense Connections · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Adam · Attention Is All You Need · Softmax · Label Smoothing · Multi-Head Attention · Dropout
