From Local Details to Global Context: Advancing Vision-Language Models with Attention-Based Selection

Lincan Cai; Jingxuan Kang; Shuang Li; Wenxuan Ma; Binhui Xie; Zhida Qin; Jian Liang

arXiv:2505.13233·cs.CV·May 20, 2025

From Local Details to Global Context: Advancing Vision-Language Models with Attention-Based Selection

Lincan Cai, Jingxuan Kang, Shuang Li, Wenxuan Ma, Binhui Xie, Zhida Qin, Jian Liang

PDF

Open Access 1 Repo

TL;DR

This paper introduces an attention-guided cropping and feature selection method called ABS that enhances vision-language models' global understanding and zero-shot performance without additional training.

Contribution

The paper proposes a novel attention-based selection technique that improves global semantic understanding in vision-language models, achieving state-of-the-art results without training.

Findings

01

ABS outperforms previous methods on out-of-distribution tasks.

02

ABS rivals few-shot and test-time adaptation methods.

03

The approach is training-free and effective in zero-shot settings.

Abstract

Pretrained vision-language models (VLMs), e.g., CLIP, demonstrate impressive zero-shot capabilities on downstream tasks. Prior research highlights the crucial role of visual augmentation techniques, like random cropping, in alignment with fine-grained class descriptions generated by large language models (LLMs), significantly enhancing zero-shot performance by incorporating multi-view information. However, the inherent randomness of these augmentations can inevitably introduce background artifacts and cause models to overly focus on local details, compromising global semantic understanding. To address these issues, we propose an \textbf{A}ttention-\textbf{B}ased \textbf{S}election (\textbf{ABS}) method from local details to global context, which applies attention-guided cropping in both raw images and feature space, supplement global semantic information through strategic feature…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bit-da/abs
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis

MethodsFocus · Contrastive Language-Image Pre-training