ResCLIP: Residual Attention for Training-free Dense Vision-language   Inference

Yuhang Yang; Jinhong Deng; Wen Li; Lixin Duan

arXiv:2411.15851·cs.CV·November 26, 2024

ResCLIP: Residual Attention for Training-free Dense Vision-language Inference

Yuhang Yang, Jinhong Deng, Wen Li, Lixin Duan

PDF

Open Access 1 Repo

TL;DR

ResCLIP introduces residual cross-correlation attention and semantic feedback modules to enhance dense vision-language inference using CLIP, enabling better spatial localization and region focus without additional training.

Contribution

The paper proposes the RCS and SFR modules that leverage intermediate layer attention and semantic maps, respectively, to improve dense prediction capabilities of CLIP without retraining.

Findings

01

Outperforms existing training-free dense vision-language methods

02

Effectively reorganizes spatial information in CLIP

03

Boosts performance on multiple benchmarks

Abstract

While vision-language models like CLIP have shown remarkable success in open-vocabulary tasks, their application is currently confined to image-level tasks, and they still struggle with dense predictions. Recent works often attribute such deficiency in dense predictions to the self-attention layers in the final block, and have achieved commendable results by modifying the original query-key attention to self-correlation attention, (e.g., query-query and key-key attention). However, these methods overlook the cross-correlation attention (query-key) properties, which capture the rich spatial correspondence. In this paper, we reveal that the cross-correlation of the self-attention in CLIP's non-final layers also exhibits localization properties. Therefore, we propose the Residual Cross-correlation Self-attention (RCS) module, which leverages the cross-correlation self-attention from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yvhangyang/resclip
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Digital Imaging for Blood Diseases

MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training · Focus