Activating Self-Attention for Multi-Scene Absolute Pose Regression
Miso Lee, Jihwan Kim, and Jae-Pil Heo

TL;DR
This paper identifies the issue of collapsed self-attention in transformer-based multi-scene pose regression models and proposes solutions to activate self-attention, leading to improved camera pose estimation accuracy.
Contribution
The work reveals the query-key embedding space distortion problem and introduces an auxiliary loss and fixed positional encoding to enhance self-attention activation in pose regression.
Findings
Outperforms existing methods in outdoor scenes
Outperforms existing methods in indoor scenes
Effectively activates self-attention in transformer models
Abstract
Multi-scene absolute pose regression addresses the demand for fast and memory-efficient camera pose estimation across various real-world environments. Nowadays, transformer-based model has been devised to regress the camera pose directly in multi-scenes. Despite its potential, transformer encoders are underutilized due to the collapsed self-attention map, having low representation capacity. This work highlights the problem and investigates it from a new perspective: distortion of query-key embedding space. Based on the statistical analysis, we reveal that queries and keys are mapped in completely different spaces while only a few keys are blended into the query region. This leads to the collapse of the self-attention map as all queries are considered similar to those few keys. Therefore, we propose simple but effective solutions to activate self-attention. Concretely, we present an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Vision and Imaging · Robot Manipulation and Learning · Human Pose and Action Recognition
