From Pixels to Paths: A Multi-Agent Framework for Editable Scientific Illustration
Jianwen Sun, Fanrui Zhang, Yukang Feng, Chuanhao Li, Zizhen Li, Jiaxin Ai, Yifan Chang, Yu Dai, Kaipeng Zhang

TL;DR
This paper introduces VisPainter, a multi-agent framework for creating editable, high-density scientific illustrations with element-level control, and proposes VisBench, a comprehensive benchmark for evaluating such illustrations.
Contribution
The paper presents a novel multi-agent system for scientific illustration that enables editable, component-wise control and introduces a new benchmark for evaluation.
Findings
VisPainter enables true element-level editing of scientific diagrams.
VisBench provides a multi-dimensional assessment of illustration quality.
Vision-language models are evaluated, revealing their strengths and limitations.
Abstract
Scientific illustrations demand both high information density and post-editability. However, current generative models have two major limitations: Frist, image generation models output rasterized images lacking semantic structure, making it impossible to access, edit, or rearrange independent visual components in the images. Second, code-based generation methods (TikZ or SVG), although providing element-level control, force users into the cumbersome cycle of "writing-compiling-reviewing" and lack the intuitiveness of manipulation. Neither of these two approaches can well meet the needs for efficiency, intuitiveness, and iterative modification in scientific creation. To bridge this gap, we introduce VisPainter, a multi-agent framework for scientific illustration built upon the model context protocol. VisPainter orchestrates three specialized modules-a Manager, a Designer, and a…
Peer Reviews
Decision·Submitted to ICLR 2026
- an agentic system for illustration generation with tool calling - a new benchmark for science illustration generation with metrics - provide controllability and edibility for output design
- agent system with prompts on LLM without much innnovation - the generation results are very simple flow charts, yet the results is not great - the benchmark size is limited, with only a few hundred annotated examples
The paper addresses a genuine gap in scientific diagram generation by bridging the divide between raster-based generative models that lack editability and code-based approaches that impose cumbersome write-compile-review cycles. The originality lies in the GUI-level interaction paradigm, which enables direct manipulation of vector elements while maintaining full editability—a practical advantage over both existing approaches. The multi-agent architecture with explicit role separation is well-mot
The computational efficiency represents a critical practical limitation, with complex diagrams requiring several tens of minutes to generate—orders of magnitude slower than single-pass diffusion models. This overhead severely restricts the framework's utility for rapid iteration or large-scale deployment. The tight coupling to Microsoft Visio fundamentally limits the work's impact and generalizability. While the authors justify this choice through development cost arguments, the 60:81 usage rati
- The problem setting of scientific illustration drawing is novel. Extendibility of this problem is not addressed well in the paper, so the problem appears to be quite narrow in scope. However, I believe this problem can be polished further into benchmarks for illustrative reasoning of language models. - The introduction of MCP tools and their benchmarks in the major conference like this is quite a new approach, and I would like to mark this as a strength rather than weaknesses, despite its “hig
Despite the novelty of this work, I find several unresolved issues within the presentation. 1. Although the authors have spared multiple pages to explain their philosophy behind the choices of quantitative scoring system, the complexity of the evaluation metrics are not fully justified. The central problem is that the authors have presented both the MCP framework and their evaluation metrics at the same time. They require in-depth justification to ensure that the performance measures are not bi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Visualization and Analytics · Multimodal Machine Learning Applications · Computer Graphics and Visualization Techniques
