Locality Alignment Improves Vision-Language Models

Ian Covert; Tony Sun; James Zou; Tatsunori Hashimoto

arXiv:2410.11087·cs.CV·March 5, 2025

Locality Alignment Improves Vision-Language Models

Ian Covert, Tony Sun, James Zou, Tatsunori Hashimoto

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces locality alignment, a post-training method for vision transformers that enhances their ability to encode local and global image semantics, thereby improving spatial reasoning in vision-language models.

Contribution

The authors propose a novel locality alignment technique and a MaskEmbed fine-tuning procedure that extract local semantic knowledge from pre-trained vision transformers without additional supervision.

Findings

01

Locality alignment improves patch-level semantic segmentation performance.

02

Models with locality alignment perform better on spatial understanding benchmarks.

03

The method enhances existing vision-language training pipelines using off-the-shelf backbones.

Abstract

Vision language models (VLMs) have seen growing adoption in recent years, but many still struggle with basic spatial reasoning errors. We hypothesize that this is due to VLMs adopting pre-trained vision backbones, specifically vision transformers (ViTs) trained with image-level supervision and minimal inductive biases. Such models may fail to encode the class contents at each position in the image, and our goal is to resolve this with a vision backbone that effectively captures both local and global image semantics. Our main insight is that we do not require new supervision to learn this capability - pre-trained models contain significant knowledge of local semantics that we can extract and use for scalable self-supervision. We propose a new efficient post-training stage for ViTs called locality alignment and a novel fine-tuning procedure called MaskEmbed that uses a masked…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 5Confidence 3

Strengths

1. The paper is overall well-written. 2. Locality alignment is efficient, requiring minimal additional computation compared to pre-training, making it a cost-effective solution. 3. The authors provide theoretical analysis and practical experiments to support their claims.

Weaknesses

1. I suggest the authors to conduct a thorough analysis of MaskEmbed's sensitivity to hyperparameters. This includes varying mask sizes, patch sampling strategies, and the influence of different reconstruction targets. By understanding these sensitivities, the paper can provide guidelines for applying MaskEmbed effectively across various scenarios. Besides, including additional evaluations that specifically test the impact of MaskEmbed on global semantic understanding tasks may help to validate

Reviewer 02Rating 6Confidence 4

Strengths

The proposed method is relatively simple and provides a notable performance improvement.

Weaknesses

1. **Issues with the Main Claim**: The paper’s primary claim is confusing. It begins by hypothesizing that VLMs perform poorly on region-level tasks due to image-level supervision and minimal inductive biases. However, pre-training methods like DINO, despite using image-level supervision, exhibit strong locality, to the point where their features can even be directly used for semantic segmentation maps. This suggests that the initial assumption may be flawed. Additionally, regarding the cla

Reviewer 03Rating 5Confidence 4

Strengths

1. The paper is well-written and easy to follow. 2. The proposed MaskEmbed training diagram is effective in learning local semantics.

Weaknesses

1. The authors need to include comparisons and discussions with more methods, such as dBOT[1] and UMG-CLIP[2]. a) dBOT[1] employs a distillation strategy similar to the method presented in the paper, so it is necessary to discuss the differences with this method and provide performance comparisons. b) UMG-CLIP[2] directly incorporates fine-grained annotations to enhance CLIP's Locality Alignment. I am curious whether the proposed method has advantages over this method in some fundamental visua

Code & Models

Repositories

shengliu66/vti
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsContrastive Language-Image Pre-training