Extract Free Dense Misalignment from CLIP

JeongYeon Nam; Jinbae Im; Wonjae Kim; and Taeho Kil

arXiv:2412.18404·cs.CV·December 25, 2024

Extract Free Dense Misalignment from CLIP

JeongYeon Nam, Jinbae Im, Wonjae Kim, and Taeho Kil

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces CLIP4DM, a zero-shot method for detecting dense misalignments in vision-language models, improving interpretability and performance without extensive fine-tuning.

Contribution

It revamps gradient-based attribution for CLIP to identify misaligned words and proposes F-CLIPScore for global misalignment assessment, achieving state-of-the-art zero-shot results.

Findings

01

State-of-the-art zero-shot dense misalignment detection

02

Competitive performance with fine-tuned models

03

Effective detection of entity-level and attribute misalignments

Abstract

Recent vision-language foundation models still frequently produce outputs misaligned with their inputs, evidenced by object hallucination in captioning and prompt misalignment in the text-to-image generation model. Recent studies have explored methods for identifying misaligned elements, aiming not only to enhance interpretability but also to improve model performance. However, current approaches primarily rely on large foundation models in a zero-shot manner or fine-tuned models with human annotations, which limits scalability due to significant computational costs. This work proposes a novel approach, dubbed CLIP4DM, for detecting dense misalignments from pre-trained CLIP, specifically focusing on pinpointing misaligned words between image and text. We carefully revamp the gradient-based attribution computation method, enabling negative gradient of individual text tokens to indicate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

naver-ai/clip4dm
pytorchOfficial

Videos

Extract Free Dense Misalignment from CLIP· underline

Taxonomy

TopicsPhotonic and Optical Devices

MethodsContrastive Language-Image Pre-training