Caption-Driven Explainability: Probing CNNs for Bias via CLIP

Patrick Koller (Northwestern University; Evanston; Illinois; United States); Amil V. Dravid (University of California; Berkeley; California; United States); Guido M. Schuster (Eastern Switzerland University of Applied Sciences; Rapperswil; St. Gallen; Switzerland); and Aggelos K. Katsaggelos (Northwestern University; Evanston; Illinois; United States)

arXiv:2510.22035·cs.CV·February 26, 2026

Caption-Driven Explainability: Probing CNNs for Bias via CLIP

Patrick Koller (Northwestern University, Evanston, Illinois, United States), Amil V. Dravid (University of California, Berkeley, California, United States), Guido M. Schuster (Eastern Switzerland University of Applied Sciences, Rapperswil, St. Gallen, Switzerland)

PDF

TL;DR

This paper introduces a caption-based explainability method for CNNs that leverages CLIP to identify dominant concepts influencing model predictions, enhancing robustness and interpretability in computer vision.

Contribution

It proposes a novel network surgery approach to integrate models with CLIP for caption-driven explanations, addressing limitations of saliency maps.

Findings

01

Identifies dominant concepts influencing CNN predictions.

02

Reduces misleading saliency from spurious features.

03

Improves model robustness through concept-based explanations.

Abstract

Robustness has become one of the most critical problems in machine learning (ML). The science of interpreting ML models to understand their behavior and improve their robustness is referred to as explainable artificial intelligence (XAI). One of the state-of-the-art XAI methods for computer vision problems is to generate saliency maps. A saliency map highlights the pixel space of an image that excites the ML model the most. However, this property could be misleading if spurious and salient features are present in overlapping pixel spaces. In this paper, we propose a caption-based XAI method, which integrates a standalone model to be explained into the contrastive language-image pre-training (CLIP) model using a novel network surgery approach. The resulting caption-based XAI model identifies the dominant concept that contributes the most to the models prediction. This explanation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.