ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models

Liyan Tang; Grace Kim; Xinyu Zhao; Thom Lake; Wenxuan Ding; Fangcong Yin; Prasann Singhal; Manya Wadhwa; Zeyu Leo Liu; Zayne Sprague; Ramya Namuduri; Bodun Hu; Juan Diego Rodriguez; Puyuan Peng; Greg Durrett

arXiv:2505.13444·cs.CL·February 12, 2026

ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models

Liyan Tang, Grace Kim, Xinyu Zhao, Thom Lake, Wenxuan Ding, Fangcong Yin, Prasann Singhal, Manya Wadhwa, Zeyu Leo Liu, Zayne Sprague, Ramya Namuduri, Bodun Hu, Juan Diego Rodriguez, Puyuan Peng, Greg Durrett

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper introduces ChartMuseum, a challenging new benchmark for evaluating the visual reasoning capabilities of large vision-language models on real-world chart understanding tasks, exposing significant performance gaps compared to humans.

Contribution

The paper presents ChartMuseum, a novel chart question answering benchmark with expert-annotated questions designed to assess complex visual and textual reasoning in LVLMs, highlighting current model limitations.

Findings

01

Models perform significantly worse than humans on the benchmark.

02

Visual reasoning questions cause a 35%-55% performance drop.

03

Current models struggle with specific categories of visual reasoning.

Abstract

Chart understanding presents a unique challenge for large vision-language models (LVLMs), as it requires the integration of sophisticated textual and visual reasoning capabilities. However, current LVLMs exhibit a notable imbalance between these skills, falling short on visual reasoning that is difficult to perform in text. We conduct a case study using a synthetic dataset solvable only through visual reasoning and show that model performance degrades significantly with increasing visual complexity, while human performance remains robust. We then introduce ChartMuseum, a new Chart Question Answering (QA) benchmark containing 1,162 expert-annotated questions spanning multiple reasoning types, curated from real-world charts across 184 sources, specifically built to evaluate complex visual and textual reasoning. Unlike prior chart understanding benchmarks -- where frontier models perform…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

lytang/ChartMuseum
dataset· 529 dl
529 dl

Videos

ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling