TL;DR
Chart2Code is a hierarchical benchmark designed to evaluate multimodal models' ability to understand and generate charts, covering tasks from reproduction to complex transformations, with extensive evaluation metrics.
Contribution
This is the first hierarchical benchmark that systematically scales task difficulty for chart understanding and code generation in multimodal models.
Findings
State-of-the-art GPT-5 achieves only 0.57 on code correctness
Models struggle with complex chart editing tasks
Benchmark contains 2,023 tasks across 22 chart types
Abstract
We introduce Chart2Code, a new benchmark for evaluating the chart understanding and code generation capabilities of large multimodal models (LMMs). Chart2Code is explicitly designed from a user-driven perspective, capturing diverse real-world scenarios and progressively increasing task difficulty. It consists of three levels: Level 1 (Chart Reproduction) reproduces charts from a reference figure and user query; Level 2 (Chart Editing) involves complex modifications such as changing chart types or adding elements; and Level 3 (Long-Table to Chart Generation) requires models to transform long, information-dense tables into faithful charts following user instructions. To our knowledge, this is the first hierarchical benchmark that reflects practical chart2code usage while systematically scaling task complexity. In total, Chart2Code contains 2,023 tasks across 22 chart types, paired with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
