Locality Alignment Improves Vision-Language Models
Ian Covert, Tony Sun, James Zou, Tatsunori Hashimoto

TL;DR
This paper introduces locality alignment, a post-training method for vision transformers that enhances their ability to encode local and global image semantics, thereby improving spatial reasoning in vision-language models.
Contribution
The authors propose a novel locality alignment technique and a MaskEmbed fine-tuning procedure that extract local semantic knowledge from pre-trained vision transformers without additional supervision.
Findings
Locality alignment improves patch-level semantic segmentation performance.
Models with locality alignment perform better on spatial understanding benchmarks.
The method enhances existing vision-language training pipelines using off-the-shelf backbones.
Abstract
Vision language models (VLMs) have seen growing adoption in recent years, but many still struggle with basic spatial reasoning errors. We hypothesize that this is due to VLMs adopting pre-trained vision backbones, specifically vision transformers (ViTs) trained with image-level supervision and minimal inductive biases. Such models may fail to encode the class contents at each position in the image, and our goal is to resolve this with a vision backbone that effectively captures both local and global image semantics. Our main insight is that we do not require new supervision to learn this capability - pre-trained models contain significant knowledge of local semantics that we can extract and use for scalable self-supervision. We propose a new efficient post-training stage for ViTs called locality alignment and a novel fine-tuning procedure called MaskEmbed that uses a masked…
Peer Reviews
Decision·ICLR 2025 Poster
1. The paper is overall well-written. 2. Locality alignment is efficient, requiring minimal additional computation compared to pre-training, making it a cost-effective solution. 3. The authors provide theoretical analysis and practical experiments to support their claims.
1. I suggest the authors to conduct a thorough analysis of MaskEmbed's sensitivity to hyperparameters. This includes varying mask sizes, patch sampling strategies, and the influence of different reconstruction targets. By understanding these sensitivities, the paper can provide guidelines for applying MaskEmbed effectively across various scenarios. Besides, including additional evaluations that specifically test the impact of MaskEmbed on global semantic understanding tasks may help to validate
The proposed method is relatively simple and provides a notable performance improvement.
1. **Issues with the Main Claim**: The paper’s primary claim is confusing. It begins by hypothesizing that VLMs perform poorly on region-level tasks due to image-level supervision and minimal inductive biases. However, pre-training methods like DINO, despite using image-level supervision, exhibit strong locality, to the point where their features can even be directly used for semantic segmentation maps. This suggests that the initial assumption may be flawed. Additionally, regarding the cla
1. The paper is well-written and easy to follow. 2. The proposed MaskEmbed training diagram is effective in learning local semantics.
1. The authors need to include comparisons and discussions with more methods, such as dBOT[1] and UMG-CLIP[2]. a) dBOT[1] employs a distillation strategy similar to the method presented in the paper, so it is necessary to discuss the differences with this method and provide performance comparisons. b) UMG-CLIP[2] directly incorporates fine-grained annotations to enhance CLIP's Locality Alignment. I am curious whether the proposed method has advantages over this method in some fundamental visua
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
MethodsContrastive Language-Image Pre-training
