ChartDiff: A Large-Scale Benchmark for Comprehending Pairs of Charts
Rongtian Ye

TL;DR
ChartDiff is a large-scale benchmark designed to evaluate models' ability to perform cross-chart comparative summarization, addressing a gap in existing chart understanding benchmarks.
Contribution
We introduce ChartDiff, the first extensive benchmark for multi-chart comparison, with diverse chart pairs and annotations, to evaluate and improve chart reasoning models.
Findings
General-purpose models achieve high GPT-based quality.
Specialized models score higher on ROUGE but lower on human judgment.
Multi-series charts remain challenging for current models.
Abstract
Charts are central to analytical reasoning, yet existing benchmarks for chart understanding focus almost exclusively on single-chart interpretation rather than comparative reasoning across multiple charts. To address this gap, we introduce ChartDiff, the first large-scale benchmark for cross-chart comparative summarization. ChartDiff consists of 8,541 chart pairs spanning diverse data sources, chart types, and visual styles, each annotated with LLM-generated and human-verified summaries describing differences in trends, fluctuations, and anomalies. Using ChartDiff, we evaluate general-purpose, chart-specialized, and pipeline-based models. Our results show that frontier general-purpose models achieve the highest GPT-based quality, while specialized and pipeline-based methods obtain higher ROUGE scores but lower human-aligned evaluation, revealing a clear mismatch between lexical overlap…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
