On the Explainability of Vision-Language Models in Art History
Stefanie Schneider

TL;DR
This paper investigates how explainable AI methods can make the visual reasoning of CLIP, a vision-language model, understandable in art history, highlighting their strengths and limitations in capturing human interpretability.
Contribution
It evaluates seven XAI methods on CLIP for art history, revealing their dependence on category stability and representational availability.
Findings
XAI methods partially capture human interpretation
Effectiveness depends on category stability
Methods vary in interpretability and localization accuracy
Abstract
Vision-Language Models (VLMs) transfer visual and textual data into a shared embedding space. In so doing, they enable a wide range of multimodal tasks, while also raising critical questions about the nature of machine 'understanding.' In this paper, we examine how Explainable Artificial Intelligence (XAI) methods can render the visual reasoning of a VLM - namely, CLIP - legible in art-historical contexts. To this end, we evaluate seven methods, combining zero-shot localization experiments with human interpretability studies. Our results indicate that, while these methods capture some aspects of human interpretation, their effectiveness hinges on the conceptual stability and representational availability of the examined categories.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis
