CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal   LLMs

Zirui Wang; Mengzhou Xia; Luxi He; Howard Chen; Yitao Liu; Richard; Zhu; Kaiqu Liang; Xindi Wu; Haotian Liu; Sadhika Malladi; Alexis Chevalier,; Sanjeev Arora; Danqi Chen

arXiv:2406.18521·cs.CL·June 27, 2024·3 cites

CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard, Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, Alexis Chevalier,, Sanjeev Arora, Danqi Chen

PDF

Open Access 1 Repo 3 Datasets 1 Video

TL;DR

CharXiv introduces a challenging, human-verified evaluation suite of diverse scientific charts to better measure the reasoning and understanding capabilities of multimodal large language models, revealing significant gaps compared to human performance.

Contribution

This work presents CharXiv, a new comprehensive and challenging dataset for evaluating chart understanding in multimodal LLMs, highlighting current models' weaknesses and providing a more realistic benchmark.

Findings

01

Proprietary GPT-4o achieves 47.1% accuracy on CharXiv.

02

Open-source InternVL Chat V1.5 achieves 29.2%.

03

Humans achieve 80.5%, showing large gaps in model performance.

Abstract

Chart understanding plays a pivotal role when applying Multimodal Large Language Models (MLLMs) to real-world tasks such as analyzing scientific papers or financial reports. However, existing datasets often focus on oversimplified and homogeneous charts with template-based questions, leading to an over-optimistic measure of progress. We demonstrate that although open-source models can appear to outperform strong proprietary models on these benchmarks, a simple stress test with slightly different charts or questions can deteriorate performance by up to 34.5%. In this work, we propose CharXiv, a comprehensive evaluation suite involving 2,323 natural, challenging, and diverse charts from arXiv papers. CharXiv includes two types of questions: 1) descriptive questions about examining basic chart elements and 2) reasoning questions that require synthesizing information across complex visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

princeton-nlp/CharXiv
pytorchOfficial

Datasets

Videos

CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Lexicography and Language Studies · Mathematics, Computing, and Information Processing