Does CLIP perceive art the same way we do?
Andrea Asperti, Leonardo Dess\`i, Maria Chiara Tonetti, Nico Wu

TL;DR
This paper investigates CLIP's ability to perceive and interpret artworks, comparing its understanding of artistic content, style, and context to human perception, revealing both strengths and limitations in its visual representations.
Contribution
It introduces targeted probing tasks to evaluate CLIP's perception of art, highlighting its capabilities and shortcomings in understanding artistic nuances and context.
Findings
CLIP effectively recognizes content and style in artworks.
CLIP shows limitations in understanding artistic intent and aesthetic nuances.
Insights inform better use of CLIP in creative and generative tasks.
Abstract
CLIP has emerged as a powerful multimodal model capable of connecting images and text through joint embeddings, but to what extent does it 'see' the same way humans do - especially when interpreting artworks? In this paper, we investigate CLIP's ability to extract high-level semantic and stylistic information from paintings, including both human-created and AI-generated imagery. We evaluate its perception across multiple dimensions: content, scene understanding, artistic style, historical period, and the presence of visual deformations or artifacts. By designing targeted probing tasks and comparing CLIP's responses to human annotations and expert benchmarks, we explore its alignment with human perceptual and contextual understanding. Our findings reveal both strengths and limitations in CLIP's visual representations, particularly in relation to aesthetic cues and artistic intent. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAesthetic Perception and Analysis · Visual Attention and Saliency Detection · Generative Adversarial Networks and Image Synthesis
MethodsContrastive Language-Image Pre-training
