Have Large Vision-Language Models Mastered Art History?
Ombretta Strafforello, Derya Soydaner, Michiel Willems, Anne-Sofie Maerten, Stefanie De Winter

TL;DR
This study evaluates whether large vision-language models can classify art styles, authors, and creation dates, comparing their reasoning abilities to human art experts through extensive analysis of multiple models and benchmarks.
Contribution
First comprehensive assessment of large VLMs' ability to interpret and classify artworks' stylistic and historical attributes, highlighting their strengths and limitations.
Findings
Models show moderate success in style classification.
Prompt sensitivity significantly affects model performance.
Models often misclassify complex or ambiguous artworks.
Abstract
The emergence of large Vision-Language Models (VLMs) has established new baselines in image classification across multiple domains. We examine whether their multimodal reasoning can also address a challenge mastered by human experts. Specifically, we test whether VLMs can classify the style, author and creation date of paintings, a domain traditionally mastered by art historians. Artworks pose a unique challenge compared to natural images due to their inherently complex and diverse structures, characterized by variable compositions and styles. This requires a contextual and stylistic interpretation rather than straightforward object recognition. Art historians have long studied the unique aspects of artworks, with style prediction being a crucial component of their discipline. This paper investigates whether large VLMs, which integrate visual and textual data, can effectively reason…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Surveying and Cultural Heritage · Archaeological Research and Protection · Image Processing and 3D Reconstruction
MethodsSparse Evolutionary Training · Contrastive Language-Image Pre-training
