ChartGalaxy: A Dataset for Infographic Chart Understanding and Generation

Zhen Li; Duan Li; Yukai Guo; Xinyuan Guo; Bowen Li; Lanxi Xiao; Shenyu Qiao; Jiashu Chen; Zijian Wu; Hui Zhang; Xinhuan Shu; Shixia Liu

arXiv:2505.18668·cs.CV·October 17, 2025

ChartGalaxy: A Dataset for Infographic Chart Understanding and Generation

Zhen Li, Duan Li, Yukai Guo, Xinyuan Guo, Bowen Li, Lanxi Xiao, Shenyu Qiao, Jiashu Chen, Zijian Wu, Hui Zhang, Xinhuan Shu, Shixia Liu

PDF

Open Access 3 Repos 1 Datasets 3 Reviews

TL;DR

ChartGalaxy is a large-scale dataset that enhances the understanding and generation of complex infographic charts by capturing their visual and structural diversity, aiding vision-language models in multimodal reasoning.

Contribution

The paper introduces ChartGalaxy, a million-scale dataset with synthetic infographic charts, enabling improved understanding, benchmarking, and generation of complex visual data.

Findings

01

Enhanced infographic chart understanding through fine-tuning.

02

Benchmarking code generation for infographic charts.

03

Facilitated example-based infographic chart generation.

Abstract

Infographic charts are a powerful medium for communicating abstract data by combining visual elements (e.g., charts, images) with textual information. However, their visual and structural richness poses challenges for large vision-language models (LVLMs), which are typically trained on plain charts. To bridge this gap, we introduce ChartGalaxy, a million-scale dataset designed to advance the understanding and generation of infographic charts. The dataset is constructed through an inductive process that identifies 75 chart types, 440 chart variations, and 68 layout templates from real infographic charts and uses them to create synthetic ones programmatically. We showcase the utility of this dataset through: 1) improving infographic chart understanding via fine-tuning, 2) benchmarking code generation for infographic charts, and 3) enabling example-based infographic chart generation. By…

Peer Reviews

Decision·ICLR 2026 ConditionalPoster

Reviewer 01Rating 6Confidence 3

Strengths

+ The dataset size is very large, and it seems to span many chart types and designs. + The dataset quality seems to be high. + The use cases are in general convincing. + The supplemental materials are a very helpful addition for presentation.

Weaknesses

- I am a bit lost in reading the curation of machine-generated synthetic infographics. I can see a lot of work went into this, but I feel like it is lacking a big picture. Please see my first question. The other parts of the paper all seem pretty easy to follow. - Huge improvements on questions based on ChartGalaxy is probably not that unsurprising given the vast majority of infographics in the QA set are based on synthetic charts with fixed templates. Finetuning is expected to help a lot here.

Reviewer 02Rating 6Confidence 3

Strengths

- 1. Templates are carefully extracted from real data and utilized for synthetic data generation, resulting in a diverse set of samples. - 2. A large number of evaluation experiments are conducted in a thorough and detailed manner. - 3. This work provides a large-scale dataset in the infographic domain, where available data has been scarce until now.

Weaknesses

There are some unclear points regarding the details and procedures of the experiments. - 1. In Section 3.3, the paper describes the extraction of layout templates. Could you clarify the format in which these templates are stored? While Figure 3 presents visual examples of the template images, it would be helpful to know whether they also contain information such as bounding boxes or other structural annotations. - 2. In Section 3.4, which discusses Element Generation, could you please clarify th

Reviewer 03Rating 8Confidence 5

Strengths

S1: Comprehensive and large-scale dataset. The dataset includes over 1.76 million infographic charts paired with tabular data, exceeding the scale of previous datasets. This scope enables training LVLMs for realistic infographic scenarios, supporting broad generalization. The inclusion of 75 chart types, 440 chart variations, and 68 layout templates reflects high visual and structural diversity. The dual-source construction (real + synthetic) provides both authenticity and scalability, addressin

Weaknesses

W1: The manuscript would benefit from a clearer definition of what qualifies as an infographic chart and a more concrete explanation of how it differs from plain charts in terms of reasoning challenges. The current description in the Introduction could be made more specific, perhaps with brief examples or a sharper comparison to existing VQA or chart QA tasks to better highlight the unique difficulty of infographic chart understanding. W2: Table 1 shows that models achieve higher accuracy on

Code & Models

Repositories

Datasets

ChartGalaxy/ChartGalaxy
dataset· 3.3k dl
3.3k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTime Series Analysis and Forecasting · Advanced Database Systems and Queries