Locality-Attending Vision Transformer
Sina Hajimiri, Farzad Beizaee, Fereshteh Shakeri, Christian Desrosiers, Ismail Ben Ayed, Jose Dolz

TL;DR
This paper introduces a simple add-on for vision transformers that biases self-attention toward local neighborhoods, significantly improving segmentation performance while maintaining classification accuracy.
Contribution
It proposes a learnable Gaussian kernel to modulate self-attention and refine patch embeddings, enhancing local detail capture in vision transformers for segmentation.
Findings
Over 6% segmentation gain on ADE20K with ViT Tiny
Over 4% segmentation gain on ADE20K with ViT Base
Retains image classification performance
Abstract
Vision transformers have demonstrated remarkable success in classification by leveraging global self-attention to capture long-range dependencies. However, this same mechanism can obscure fine-grained spatial details crucial for tasks such as segmentation. In this work, we seek to enhance segmentation performance of vision transformers after standard image-level classification training. More specifically, we present a simple yet effective add-on that improves performance on segmentation tasks while retaining vision transformers' image-level recognition capabilities. In our approach, we modulate the self-attention with a learnable Gaussian kernel that biases the attention toward neighboring patches. We further refine the patch representations to learn better embeddings at patch positions. These modifications encourage tokens to focus on local surroundings and ensure meaningful…
Peer Reviews
Decision·ICLR 2026 Poster
+ Conceptually simple but effective modification. In specific, this paper computes a Gaussian kernel based on the query token and incorporates local spatial information into the attention logits, thereby balancing both global and local attention, which is beneficial for segmentation tasks. In addition, to mitigate the limitations of the [CLS] token, this work applies a parameter-free "self-attention"-style refinement to the output of the final layer. Though the solution is simple and straightfor
- The paper lacks visualization of the method's effects. For example, case studies showing attention heatmaps after applying Gaussian-Augmented attention and Patch Representation Refinement. - It is recommended to highlight the motivation for using the Gaussian kernel before the Method section. For example, by listing a table that qualitatively compares convolution-based hybrids, locality mechanisms inside attention, positional encodings, etc., and explicitly points out the advantages of using
1. LocAtViT introduces locality awareness through the GAug and PRR modules with minimal architectural modification. 2. The proposed modules yield substantial improvements on segmentation benchmarks while preserving or even slightly improving ImageNet classification accuracy. 3. The work highlights a valuable perspective that ViT pretraining can be enhanced for dense prediction by refining patch-level representations.
1. Although the author pointed out that the global attention mechanism of ViT is not conducive to capturing local details, there is a lack of analysis on the degradation of local features in the baseline ViT. 2. The paper directly proposed Gaussian-Augmented attention, but did not explain why a Gaussian kernel was chosen (instead of other forms of local attenuation functions) and its correspondence with human vision or signal attenuation models. The lack of theoretical or empirical support mak
1) Simple, plug-in design with low overhead: Aligns with trends showing locality helps ViTs; uses a soft bias so global context remains available. 2) Clear target and protocol: Frozen-backbone segmentation fairly isolates representational gains; positive results in self-supervised DINO suggest generality beyond supervised pretraining. 3) Potential impact: Cost seems negligible and code is clean, thus many ViT backbones could adopt the tweak during pretraining to become more “segmentation-ready
1) Full fine-tuning and detection: Frozen-backbone segmentation is informative but not standard practice; include end-to-end segmentation fine-tuning and at least one object detection benchmark (e.g., COCO with a simple detector) to test if gains persist under typical training. 2) Ablation transparency: Report stability and learned variance scales of the Gaussian (do they collapse or saturate?), and whether the [CLS] treatment or bias masking affects results. 3) Aggregation vs. class-attention
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · Advanced Neural Network Applications · CCD and CMOS Imaging Sensors
