PlotCraft: Pushing the Limits of LLMs for Complex and Interactive Data Visualization
Jiajun Zhang, Jianke Zhang, Zeyu Cui, Jiaxi Yang, Lei Zhang, Binyuan Hui, Qiang Liu, Zilei Wang, Liang Wang, Junyang Lin

TL;DR
PlotCraft introduces a comprehensive benchmark and dataset for evaluating and improving large language models' ability to generate complex, interactive data visualizations across diverse domains, addressing current performance gaps.
Contribution
The paper presents PlotCraft, a new benchmark and dataset for complex visualization tasks, and introduces PlotCraftor, a small yet effective code generation model that significantly improves visualization capabilities.
Findings
Existing LLMs perform poorly on complex visualization tasks.
PlotCraftor achieves over 50% performance improvement on hard tasks.
The benchmark and dataset enable systematic evaluation and development of visualization models.
Abstract
Recent Large Language Models (LLMs) have demonstrated remarkable proficiency in code generation. However, their ability to create complex visualizations for scaled and structured data remains largely unevaluated and underdeveloped. To address this gap, we introduce PlotCraft, a new benchmark featuring 1k challenging visualization tasks that cover a wide range of topics, such as finance, scientific research, and sociology. The benchmark is structured around seven high-level visualization tasks and encompasses 48 distinct chart types. Crucially, it is the first to systematically evaluate both single-turn generation and multi-turn refinement across a diverse spectrum of task complexities. Our comprehensive evaluation of 23 leading LLMs on PlotCraft reveals obvious performance deficiencies in handling sophisticated visualization tasks. To bridge this performance gap, we develope…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- Relevant topic: The paper addresses an emerging area in LLM-based complex visualization generation and outlines a research gap that is worth exploring. It also proposes a benchmark to facilitate further study in this space. - Use of real-world data: The SYNTHVIS-30K dataset is reasonably large and built from real data sources, which helps enhance its practical value. - Evaluation design: The paper presents experiments across 24 models and provides an analysis of their behavior in visualizati
- **Undefined notion of “multi-chart” generation:** The paper does not clearly define what constitutes *multi-chart* or *multi-plot* generation. If it merely refers to placing several plots side-by-side, the novelty and significance are limited. A more meaningful definition should clarify whether the goal involves cross-chart coherence, shared data semantics, or insight-driven multi-view coordination. It is also unclear whether the dataset considers the relationship between generated charts and
- new dataset capturing a much richer set of visualizations comparing to previous benchmarks, better for measuring model's data visualization performance. - the synthetic benchmark provides source for finetuning models towards the goal.
- The task in this dataset is quite detailed and verbose, which may not match how such models would be used in practice. But on the other hand, I don't think this is a big issue since the paper focuses on model capability measurement. But this worth emphasis, since many practical visualization tasks are quite more open-ended from a high-level question. - From examples in appendix, some charts with layout issues could be the result of the data and instruction characteristics (i.e., if the model f
- **Timely Benchmark** (PlotCraft): The paper introduces a much-needed benchmark that moves beyond simple text-to-chart tasks (e.g. VisEval, NVbench). I think this benchmark is a timely step towards better alignment of llm’s coding capability and people’s practical needs for visualization generation. - **Strong Empirical Results and Model** (PlotCraftor): The comprehensive evaluation across 24 models provides clear evidence of the current state-of-the-art and its limitations. The proposed PlotCr
- **Reliance on Automated Evaluation**: While the authors validate a MLLM-based judge (Gemini-2.5-Pro) against human scores, current MLLM can still miss important visual flaws / quality aspects that can be easy for human to inspect. Moreover, relying on a certain version of proprietary MLLM (Gemini-2.5-Pro) for benchmark could lead to potential bias and stability risks. These limitation should be discussed more prominently. Some relevant works that are recommend to review and discuss: [1] VisJud
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Visualization and Analytics · Scientific Computing and Data Management · Machine Learning in Materials Science
