Breaking the SFT Plateau: Multimodal Structured Reinforcement Learning for Chart-to-Code Generation
Lei Chen, Xuanle Zhao, Zhixiong Zeng, Jing Huang, Liming Zheng, Yufeng Zhong, Lin Ma

TL;DR
This paper introduces Multimodal Structured Reinforcement Learning (MSRL), a novel approach that combines textual and visual feedback to significantly improve chart-to-code generation, surpassing the limitations of supervised fine-tuning.
Contribution
It proposes MSRL with a multi-granularity reward system and a two-stage curriculum training strategy, achieving state-of-the-art results on large-scale real-world datasets.
Findings
MSRL improves performance by over 6% on ChartMimic.
MSRL outperforms existing methods in chart-to-code tasks.
Large-scale dataset of 3 million chart-code pairs enhances training.
Abstract
While reinforcement learning (RL) has proven highly effective for general reasoning in vision-language models, its application to tasks requiring deep understanding of information-rich images and structured output generation remains underexplored. Chart-to-code generation exemplifies this challenge, demanding complex reasoning over visual charts to produce structured code. Supervised fine-tuning (SFT) alone is often insufficient, highlighting the need for effective RL strategies tailored to structured outputs. In this paper, we systematically investigate the performance plateau of SFT through large-scale experiments and propose Multimodal Structured Reinforcement Learning (MSRL) for chart-to-code generation. We construct the largest training corpus to date, with 3 million chart-code pairs curated from real-world tables in arXiv papers, addressing the limitations of previous synthetic…
Peer Reviews
Decision·ICLR 2026 Poster
* The proposed RL approach shows performance improvements on the chart-to-code task that can’t be achieved by scaling the SFT data/training on its own. These experiments are quite interesting and could have valuable insights. * The proposed model achieves the state-of-the-art results on two chart-to-code benchmarks: ChartMimic and ReachQA. The authors have provided detailed ablation experiments to show the benefits from each of their proposed techniques/reward functions.
* Limited Visual diversity: using a set of example codes and forcing the code to follow a specific structure may significantly limit the visual diversity of the dataset. There’s also no analysis to support the claim of visual diversity compared to existing datasets/approaches. * The evaluation is only limited to two benchmarks: ChartMimic and ReachQA. Furthermore, it’s limited to the niche chart-to-code task. It would strengthen the contribution of the paper if the RL approach can be expanded
1. Clear empirical story about the SFT ceiling. authors isolate SFT scaling curve and convincingly shows a plateau beyond 2M samples before introducing RL, this sort of strengthens the causal claim that RL brings the next jump, not just under-tuned SFT. 2. The textual reward normalizes code and scores specific fields (data, type, layout, titles/labels, exec) while visual reward compares rendered images against ground truth via an MLLM.. the two-stage schedule is sensible imo and empirically vali
- Clearly the visual reward depends on Qwen2.5-VL as judge. If the policy aligns to judge's biases or defects, improvements could reflect evaluator gaming rather than genuine fidelity. Perhaps a judge-swap test (e.g., different MLLM/human) is needed to rule out any judge overfitting. - The entire pipeline centers on Matplotlib-style code and a fixed rendering toolchain. It is unclear whether the learned behaviors transfer to other plotting libraries (Seaborn/plotly or vega) or different runtimes
* Clear diagnosis of the SFT plateau. The scaling curve explicitly plateaus after ~2M SFT examples, and shows visual RL breaks the curve. * Large curated and balanced chart‑code corpus from real‑world arXiv is great contribution to the community. * MSRL (7B) outperforms ChartCoder and is competitive with GPT‑4o on component scores such as layout and text (Tables 1–2, p. 6), with qualitative cases showing better fidelity and execution reliability than open and proprietary models
* The visual reward (and RL data filtering) rely on a single MLLM judge (Qwen2.5‑VL‑72B). Without cross‑judge verification or human studies, there is a very risk of reward hacking or bias toward that evaluator’s preferences. * Possible double‑counting of execution success in stage 2, in Sec 3.3 R = w_t*R_text + w_v*R_vis + W_e*R_exec, R_text also contains R_exec * Compute-normalized ablation for two-stage vs single-stage: It looks possible that the two-stage curriculum outperforms single-stage p
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Measurement and Metrology Techniques · Manufacturing Process and Optimization · Model-Driven Software Engineering Techniques
