ArtSeek: Deep artwork understanding via multimodal in-context reasoning and late interaction retrieval
Nicola Fanelli, Gennaro Vessio, Giovanna Castellano

TL;DR
ArtSeek is a multimodal framework that combines image analysis, retrieval, and reasoning to understand artworks deeply, leveraging a new dataset and achieving state-of-the-art results in art classification and captioning.
Contribution
The paper introduces ArtSeek, a novel multimodal art analysis system that operates solely on images and incorporates a large-scale knowledge dataset for improved reasoning.
Findings
+8.4% F1 in style classification
+7.1 BLEU@1 in captioning
Effective interpretation of visual motifs and historical context
Abstract
Analyzing digitized artworks presents unique challenges, requiring not only visual interpretation but also a deep understanding of rich artistic, contextual, and historical knowledge. We introduce ArtSeek, a multimodal framework for art analysis that combines multimodal large language models with retrieval-augmented generation. Unlike prior work, our pipeline relies only on image input, enabling applicability to artworks without links to Wikidata or Wikipedia-common in most digitized collections. ArtSeek integrates three key components: an intelligent multimodal retrieval module based on late interaction retrieval, a contrastive multitask classification network for predicting artist, genre, style, media, and tags, and an agentic reasoning strategy enabled through in-context examples for complex visual question answering and artwork explanation via Qwen2.5-VL. Central to this approach is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
