CArtBench: Evaluating Vision-Language Models on Chinese Art Understanding, Interpretation, and Authenticity

Xuefeng Wei; Zhixuan Wang; Xuan Zhou; Zhi Qu; Hongyao Li; Yusuke Sakai; Hidetaka Kamigaito; Taro Watanabe

arXiv:2604.11632·cs.CL·April 14, 2026

CArtBench: Evaluating Vision-Language Models on Chinese Art Understanding, Interpretation, and Authenticity

Xuefeng Wei, Zhixuan Wang, Xuan Zhou, Zhi Qu, Hongyao Li, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe

PDF

TL;DR

CARTBENCH is a comprehensive benchmark for evaluating Chinese art understanding, interpretation, and authenticity reasoning in vision-language models, revealing significant gaps in current model capabilities.

Contribution

It introduces four novel subtasks tailored for Chinese artworks, aligning museum data with expert standards to assess diverse reasoning skills in VLMs.

Findings

01

High overall accuracy masks difficulty in evidence linking and style inference

02

Models perform poorly on expert-style appreciation tasks

03

Authenticity discrimination remains near chance levels

Abstract

We introduce CARTBENCH, a museum-grounded benchmark for evaluating vision-language models (VLMs) on Chinese artworks beyond short-form recognition and QA. CARTBENCH comprises four subtasks: CURATORQA for evidence-grounded recognition and reasoning, CATALOGCAPTION for structured four-section expert-style appreciation, REINTERPRET for defensible reinterpretation with expert ratings, and CONNOISSEURPAIRS for diagnostic authenticity discrimination under visually similar confounds. CARTBENCH is built by aligning image-bearing Palace Museum objects from Wikidata with authoritative catalog pages, spanning five art categories across multiple dynasties. Across nine representative VLMs, we find that high overall CURATORQA accuracy can mask sharp drops on hard evidence linking and style-to-period inference; long-form appreciation remains far from expert references; and authenticity-oriented…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.