ResCLIP: Residual Attention for Training-free Dense Vision-language Inference
Yuhang Yang, Jinhong Deng, Wen Li, Lixin Duan

TL;DR
ResCLIP introduces residual cross-correlation attention and semantic feedback modules to enhance dense vision-language inference using CLIP, enabling better spatial localization and region focus without additional training.
Contribution
The paper proposes the RCS and SFR modules that leverage intermediate layer attention and semantic maps, respectively, to improve dense prediction capabilities of CLIP without retraining.
Findings
Outperforms existing training-free dense vision-language methods
Effectively reorganizes spatial information in CLIP
Boosts performance on multiple benchmarks
Abstract
While vision-language models like CLIP have shown remarkable success in open-vocabulary tasks, their application is currently confined to image-level tasks, and they still struggle with dense predictions. Recent works often attribute such deficiency in dense predictions to the self-attention layers in the final block, and have achieved commendable results by modifying the original query-key attention to self-correlation attention, (e.g., query-query and key-key attention). However, these methods overlook the cross-correlation attention (query-key) properties, which capture the rich spatial correspondence. In this paper, we reveal that the cross-correlation of the self-attention in CLIP's non-final layers also exhibits localization properties. Therefore, we propose the Residual Cross-correlation Self-attention (RCS) module, which leverages the cross-correlation self-attention from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Digital Imaging for Blood Diseases
MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training · Focus
