TL;DR
A-MAR introduces an agent-based framework for multimodal art retrieval that explicitly plans reasoning steps, improving interpretability and evidence grounding in artwork understanding.
Contribution
It presents a novel structured reasoning plan approach for multimodal art retrieval, enhancing explainability and multi-step reasoning capabilities.
Findings
A-MAR outperforms static retrieval and baseline models in explanation quality.
It demonstrates improved evidence grounding and reasoning on ArtCoT-QA.
Code and data are publicly available at the provided GitHub link.
Abstract
Understanding artworks requires multi-step reasoning over visual content and cultural, historical, and stylistic context. While recent multimodal large language models show promise in artwork explanation, they rely on implicit reasoning and internalized knowl- edge, limiting interpretability and explicit evidence grounding. We propose A-MAR, an Agent-based Multimodal Art Retrieval framework that explicitly conditions retrieval on structured reasoning plans. Given an artwork and a user query, A-MAR first decomposes the task into a structured reasoning plan that specifies the goals and evidence requirements for each step. Retrieval is then conditionedon this plan, enabling targeted evidence selection and supporting step-wise, grounded explanations. To evaluate agent-based multi- modal reasoning within the art domain, we introduce ArtCoT-QA. This diagnostic benchmark features multi-step…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
