ChartMaster: Advancing Chart-to-Code Generation with Real-World Charts and Chart Similarity Reinforcement Learning
Wentao Tan, Qiong Cao, Chao Xue, Yibing Zhan, Changxing Ding, Xiaodong He

TL;DR
ChartMaster introduces a large, diverse dataset of real-world charts and a reinforcement learning approach with a novel similarity reward to improve chart-to-code generation accuracy and visual fidelity.
Contribution
The paper presents ReChartPrompt, a large-scale real-world chart dataset, and ChartSimRL, a reinforcement learning method with a new similarity reward, advancing chart-to-code generation.
Findings
Achieves state-of-the-art results among 7B-parameter models.
Rivals GPT-4o on chart-to-code benchmarks.
Enhances visual fidelity and diversity in generated charts.
Abstract
The chart-to-code generation task requires MLLMs to convert chart images into executable code. This task faces two main challenges: limited data diversity and the difficulty of maintaining visual consistency between generated charts and the original ones. Existing datasets mainly rely on synthetic seed data to prompt GPT models for code generation, resulting in homogeneous samples that limit model generalization to real-world chart styles. To address this, we propose ReChartPrompt, leveraging real-world, human-designed charts extracted from arXiv papers as prompts. By harnessing the rich content and diverse visual styles of arXiv charts, we construct ReChartPrompt-240K, a large-scale and highly diverse dataset that better reflects realistic chart variations. For the second challenge, although SFT improves code understanding by optimizing next-token prediction, it does not provide direct…
Peer Reviews
Decision·Submitted to ICLR 2026
Originality: While the overall training paradigm (data distillation → SFT → GRPO) follows established LLM fine-tuning practices, the paper demonstrates an original and well-motivated application of this pipeline to the underexplored domain of chart-to-code generation. The proposed ChartSimRL introduces a novel dual-reward design that jointly leverages visual and attribute similarity signals—an inventive adaptation of multimodal reward shaping to code generation tasks. This represents a meaningfu
1. Insufficient analysis of the SFT–RL interplay. The paper does not clearly isolate the contribution of the SFT and GRPO stages in the final model’s performance. Specifically, it remains unclear whether the improvement of ChartMaster over the SFT baseline arises from the GRPO phase itself or from the preceding supervised fine-tuning on ReChartPrompt. The authors did not conduct an experiment where Qwen2.5-VL-7B is directly fine-tuned with GRPO without SFT, which would have provided stronger evi
* The ReChart Prompt pipeline is quite novel and helps increase the visual diversity of the resulting dataset. It proposes a “chart replotting” technique that conditions the generation of synthetic chart-code pairs on real-world charts. * The authors designed two novel reward functions for the GRPO algorithm with detailed ablations to justify and support their design choices (Tables 3, 4, and 5). Also, the resulting model ChartMaster achieves SOTA results on a variety of chart-to-code tasks.
* The chart images are sourced from one source, arXiv, which may limit the visual and topics diversity in the dataset. * The authors claim that their approach of replotting real-world chart images increases the diversity of the dataset compared to existing approaches that just prompts LLM to generate chart-code pairs. While I believe this is likely true, there’s no analysis to support this claim. * The paper is only limited to chart-to-code which is a very niche task and doesn’t explore the
- Clear algorithmic core is heart of the paper, where reward is explicitly defined and implementable (Jaccard on attributes + ResNet-18 cosine on multi-level features under a GRPO objective). This is a reasonable and task-aligned signal for chart reproduction. - Competitive 7B results are posted. authors claims best open-source 7B performance across several chart-to-code benchmarks.. the table lists both closed and open models for context. - Leakage awareness is studied to an exten. Authors stat
- my one concern is that their data construction lacks quantitative auditing. While the paper says it crawls real-world tables/charts and filters non-executable code, it doesn’t quantify data quality: e.g., % of images incorrectly chart-typed, code-render success rate after filtering, or inter-annotator checks. Without a data card–style audit, it’s hard to judge noise, coverage, and bias in ReChartPrompt. - [minor] does reward overfit to style surrogates? their visual reward uses ResNet-18 featu
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTime Series Analysis and Forecasting
