Effective Training Data Synthesis for Improving MLLM Chart Understanding

Yuwei Yang; Zeyu Zhang; Yunzhong Hou; Zhuowan Li; Gaowen Liu; Ali Payani; Yuan-Sen Ting; Liang Zheng

arXiv:2508.06492·cs.CV·August 11, 2025

Effective Training Data Synthesis for Improving MLLM Chart Understanding

Yuwei Yang, Zeyu Zhang, Yunzhong Hou, Zhuowan Li, Gaowen Liu, Ali Payani, Yuan-Sen Ting, Liang Zheng

PDF

Open Access 1 Models 2 Datasets

TL;DR

This paper introduces a modular data synthesis pipeline that creates a diverse and high-quality chart dataset, significantly enhancing the ability of multimodal large language models to understand scientific plots.

Contribution

The study presents a novel five-step data synthesis pipeline for generating diverse, high-quality chart datasets to improve MLLM chart understanding capabilities.

Findings

01

ECD dataset improves MLLM performance on real-world charts

02

Diversified visual details enhance model understanding

03

Synthetic data boosts accuracy on complex charts

Abstract

Being able to effectively read scientific plots, or chart understanding, is a central part toward building effective agents for science. However, existing multimodal large language models (MLLMs), especially open-source ones, are still falling behind with a typical success rate of 30%-50% on challenging benchmarks. Previous studies on fine-tuning MLLMs with synthetic charts are often restricted by their inadequate similarity to the real charts, which could compromise model training and performance on complex real-world charts. In this study, we show that modularizing chart generation and diversifying visual details improves chart understanding capabilities. In particular, we design a five-step data synthesis pipeline, where we separate data and function creation for single plot generation, condition the generation of later subplots on earlier ones for multi-subplot figures, visually…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
ChartFoundation/ECD_Finetuned_MLLMs
model

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Computational Techniques and Applications