ChartBench: A Benchmark for Complex Visual Reasoning in Charts
Zhengzhuo Xu, Sinan Du, Yiyan Qi, Chengjin Xu, Chun Yuan, Jian Guo

TL;DR
ChartBench is a new comprehensive benchmark with extensive data and an improved evaluation metric designed to assess multimodal models' complex visual reasoning in charts, revealing current limitations and guiding future improvements.
Contribution
The paper introduces ChartBench, a large-scale chart comprehension benchmark with novel evaluation metrics and baseline models, addressing limitations of existing benchmarks.
Findings
MLLMs show limited chart understanding capabilities.
Enhanced evaluation metric Acc+ effectively assesses model performance.
Baseline models improve reasoning on unannotated charts.
Abstract
Multimodal Large Language Models (MLLMs) have shown impressive capabilities in image understanding and generation. However, current benchmarks fail to accurately evaluate the chart comprehension of MLLMs due to limited chart types and inappropriate metrics. To address this, we propose ChartBench, a comprehensive benchmark designed to assess chart comprehension and data reliability through complex visual reasoning. ChartBench includes 42 categories, 66.6k charts, and 600k question-answer pairs. Notably, many charts lack data point annotations, which requires MLLMs to derive values similar to human understanding by leveraging inherent chart elements such as color, legends, and coordinate systems. We also design an enhanced evaluation metric, Acc+, to evaluate MLLMs without extensive manual or costly LLM-based evaluations. Furthermore, we propose two baselines based on the chain of thought…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The benchmark includes a wide array of chart types beyond conventional bar and line charts, enhancing its utility for complex visual reasoning evaluation. 2. Acc++ mitigates the limitations of earlier metrics by reducing model reliance on chance, providing a refined tool for measuring comprehension accuracy. 3. Baseline Models and Task Variety: The paper provides two baseline models based on chain-of-thought and supervised fine-tuning, which effectively highlight the limitations in current
1. The value extraction metric based on Acc++ does not make sense. Since the metadata has the full CSV annotation of the chart data, an effective method to evaluate the ability of MLLMs should be the comprehensive recognition of all elements in the chart. A binary classification for the value extraction task cannot fully demonstrate the visual understanding capability of MLLMs. 2. The template-based generation of instruction data somehow lacks diversity. As illustrated in Figure 3(c), the distr
1. This paper is well-structured and presented, with a logical flow that makes it accessible to readers across different expertise levels. 2. The paper's major contribution lies in its comprehensive dataset construction. The inclusion of 9 major categories and 42 subcategories of charts represents an advancement in chart understanding. This extensive coverage substantially surpasses existing benchmarks and provides a more realistic evaluation framework for VLMs.
Besides the strength, I find several aspects of this paper warrant discussion. 1. While the work expands the range of chart types and introduces the ACC++ evaluation metric, the technical innovation appears somewhat limited beyond these contributions. This raises questions about the proposed benchmark's overall novelty. 2. A more fundamental concern relates to the dataset generation approach. The authors utilize standard chart plotting libraries, which generate relatively straightforward visual
1、By introducing a large-scale dataset with unannotated charts, this work advances visual understanding research that more closely aligns with real-world applications. 2、The proposed strategies of CoT and SFT are effective in the tasks. 3、The experiment of the paper is comprehensive.
1、There seems to be little innovation or discovery in this pipeline, the proposed baseline methods (Chain-of-Thought and supervised fine-tuning) are well-established techniques in the field. 2、How does the data generation pipeline differ from other chart-related datasets? 3、Have you considered testing the models' generalization ability on charts from completely different domains or visual styles not present in the training data? 4、Some recent strong baseline models do not appear to have been
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Linear Layer · Linear Warmup With Cosine Annealing · Weight Decay · Dropout · Attention Dropout · Layer Normalization · Multi-Head Attention
