Concept Regions Matter: Benchmarking CLIP with a New Cluster-Importance Approach

Aishwarya Agarwal; Srikrishna Karanam; Vineet Gandhi

arXiv:2511.12978·cs.CV·November 18, 2025

Concept Regions Matter: Benchmarking CLIP with a New Cluster-Importance Approach

Aishwarya Agarwal, Srikrishna Karanam, Vineet Gandhi

PDF

Open Access

TL;DR

This paper introduces CCI, a new interpretability method for CLIP that improves faithfulness in understanding model reliance on concepts, and COVAR, a benchmark for disentangling background and foreground effects, advancing robustness evaluation of vision-language models.

Contribution

The paper proposes CCI, a novel cluster-based interpretability method, and COVAR, a benchmark for disentangling background and foreground influences, enhancing analysis of CLIP's robustness.

Findings

01

CCI outperforms previous interpretability methods on faithfulness benchmarks.

02

Combining CCI with GroundedSAM enables automatic categorization of prediction drivers.

03

COVAR reveals that many errors are due to viewpoint, scale, and fine-grained confusions.

Abstract

Contrastive vision-language models (VLMs) such as CLIP achieve strong zero-shot recognition yet remain vulnerable to spurious correlations, particularly background over-reliance. We introduce Cluster-based Concept Importance (CCI), a novel interpretability method that uses CLIP's own patch embeddings to group spatial patches into semantically coherent clusters, mask them, and evaluate relative changes in model predictions. CCI sets a new state of the art on faithfulness benchmarks, surpassing prior methods by large margins; for example, it yields more than a twofold improvement on the deletion-AUC metric for MS COCO retrieval. We further propose that CCI, when combined with GroundedSAM, automatically categorizes predictions as foreground- or background-driven, providing a crucial diagnostic ability. Existing benchmarks such as CounterAnimals, however, rely solely on accuracy and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Explainable Artificial Intelligence (XAI)