ArtSeek: Deep artwork understanding via multimodal in-context reasoning and late interaction retrieval

Nicola Fanelli; Gennaro Vessio; Giovanna Castellano

arXiv:2507.21917·cs.CV·July 30, 2025

ArtSeek: Deep artwork understanding via multimodal in-context reasoning and late interaction retrieval

Nicola Fanelli, Gennaro Vessio, Giovanna Castellano

PDF

2 Datasets

TL;DR

ArtSeek is a multimodal framework that combines image analysis, retrieval, and reasoning to understand artworks deeply, leveraging a new dataset and achieving state-of-the-art results in art classification and captioning.

Contribution

The paper introduces ArtSeek, a novel multimodal art analysis system that operates solely on images and incorporates a large-scale knowledge dataset for improved reasoning.

Findings

01

+8.4% F1 in style classification

02

+7.1 BLEU@1 in captioning

03

Effective interpretation of visual motifs and historical context

Abstract

Analyzing digitized artworks presents unique challenges, requiring not only visual interpretation but also a deep understanding of rich artistic, contextual, and historical knowledge. We introduce ArtSeek, a multimodal framework for art analysis that combines multimodal large language models with retrieval-augmented generation. Unlike prior work, our pipeline relies only on image input, enabling applicability to artworks without links to Wikidata or Wikipedia-common in most digitized collections. ArtSeek integrates three key components: an intelligent multimodal retrieval module based on late interaction retrieval, a contrastive multitask classification network for predicting artist, genre, style, media, and tags, and an agentic reasoning strategy enabled through in-context examples for complex visual question answering and artwork explanation via Qwen2.5-VL. Central to this approach is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.