On the Explainability of Vision-Language Models in Art History

Stefanie Schneider

arXiv:2602.20853·cs.CV·February 25, 2026

On the Explainability of Vision-Language Models in Art History

Stefanie Schneider

PDF

Open Access

TL;DR

This paper investigates how explainable AI methods can make the visual reasoning of CLIP, a vision-language model, understandable in art history, highlighting their strengths and limitations in capturing human interpretability.

Contribution

It evaluates seven XAI methods on CLIP for art history, revealing their dependence on category stability and representational availability.

Findings

01

XAI methods partially capture human interpretation

02

Effectiveness depends on category stability

03

Methods vary in interpretability and localization accuracy

Abstract

Vision-Language Models (VLMs) transfer visual and textual data into a shared embedding space. In so doing, they enable a wide range of multimodal tasks, while also raising critical questions about the nature of machine 'understanding.' In this paper, we examine how Explainable Artificial Intelligence (XAI) methods can render the visual reasoning of a VLM - namely, CLIP - legible in art-historical contexts. To this end, we evaluate seven methods, combining zero-shot localization experiments with human interpretability studies. Our results indicate that, while these methods capture some aspects of human interpretation, their effectiveness hinges on the conceptual stability and representational availability of the examined categories.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis