CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard, Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, Alexis Chevalier,, Sanjeev Arora, Danqi Chen

TL;DR
CharXiv introduces a challenging, human-verified evaluation suite of diverse scientific charts to better measure the reasoning and understanding capabilities of multimodal large language models, revealing significant gaps compared to human performance.
Contribution
This work presents CharXiv, a new comprehensive and challenging dataset for evaluating chart understanding in multimodal LLMs, highlighting current models' weaknesses and providing a more realistic benchmark.
Findings
Proprietary GPT-4o achieves 47.1% accuracy on CharXiv.
Open-source InternVL Chat V1.5 achieves 29.2%.
Humans achieve 80.5%, showing large gaps in model performance.
Abstract
Chart understanding plays a pivotal role when applying Multimodal Large Language Models (MLLMs) to real-world tasks such as analyzing scientific papers or financial reports. However, existing datasets often focus on oversimplified and homogeneous charts with template-based questions, leading to an over-optimistic measure of progress. We demonstrate that although open-source models can appear to outperform strong proprietary models on these benchmarks, a simple stress test with slightly different charts or questions can deteriorate performance by up to 34.5%. In this work, we propose CharXiv, a comprehensive evaluation suite involving 2,323 natural, challenging, and diverse charts from arXiv papers. CharXiv includes two types of questions: 1) descriptive questions about examining basic chart elements and 2) reasoning questions that require synthesizing information across complex visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Lexicography and Language Studies · Mathematics, Computing, and Information Processing
