Extract Free Dense Misalignment from CLIP
JeongYeon Nam, Jinbae Im, Wonjae Kim, and Taeho Kil

TL;DR
This paper introduces CLIP4DM, a zero-shot method for detecting dense misalignments in vision-language models, improving interpretability and performance without extensive fine-tuning.
Contribution
It revamps gradient-based attribution for CLIP to identify misaligned words and proposes F-CLIPScore for global misalignment assessment, achieving state-of-the-art zero-shot results.
Findings
State-of-the-art zero-shot dense misalignment detection
Competitive performance with fine-tuned models
Effective detection of entity-level and attribute misalignments
Abstract
Recent vision-language foundation models still frequently produce outputs misaligned with their inputs, evidenced by object hallucination in captioning and prompt misalignment in the text-to-image generation model. Recent studies have explored methods for identifying misaligned elements, aiming not only to enhance interpretability but also to improve model performance. However, current approaches primarily rely on large foundation models in a zero-shot manner or fine-tuned models with human annotations, which limits scalability due to significant computational costs. This work proposes a novel approach, dubbed CLIP4DM, for detecting dense misalignments from pre-trained CLIP, specifically focusing on pinpointing misaligned words between image and text. We carefully revamp the gradient-based attribution computation method, enabling negative gradient of individual text tokens to indicate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsPhotonic and Optical Devices
MethodsContrastive Language-Image Pre-training
