From EduVisBench to EduVisAgent: A Benchmark and Multi-Agent Framework for Reasoning-Driven Pedagogical Visualization
Haonian Ji, Shi Qiu, Siyang Xin, Siwei Han, Zhaorun Chen, Dake Zhang, Hongyi Wang, Huaxiu Yao

TL;DR
This paper introduces EduVisBench, a comprehensive benchmark for evaluating visual reasoning in educational models, and EduVisAgent, a multi-agent framework that significantly improves pedagogical visualizations by coordinating specialized agents.
Contribution
The paper presents EduVisBench for assessing visual reasoning in educational models and proposes EduVisAgent, a multi-agent system that enhances visualization quality and reasoning capabilities.
Findings
Existing models struggle with visual reasoning and decomposition.
EduVisAgent outperforms baselines with a 40.2% improvement.
EduVisAgent produces more educationally aligned visualizations.
Abstract
While foundation models (FMs), such as diffusion models and large vision-language models (LVLMs), have been widely applied in educational contexts, their ability to generate pedagogically effective visual explanations remains limited. Most existing approaches focus primarily on textual reasoning, overlooking the critical role of structured and interpretable visualizations in supporting conceptual understanding. To better assess the visual reasoning capabilities of FMs in educational settings, we introduce EduVisBench, a multi-domain, multi-level benchmark. EduVisBench features diverse STEM problem sets requiring visually grounded solutions, along with a fine-grained evaluation rubric informed by pedagogical theory. Our empirical analysis reveals that existing models frequently struggle with the inherent challenge of decomposing complex reasoning and translating it into visual…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper tackles a timely and important problem. As AI becomes more integrated into education, the ability to generate not just correct answers but effective teaching materials is paramount. The focus on pedagogical visualization as a distinct capability gap in FMs is novel and well-motivated. 2. The design of EduVisAgent is not an ad-hoc collection of agents but is thoughtfully grounded in pedagogical theory, mimicking the division of labor in instructional design. The performance improvem
1. The EduVisAgent framework consists of five distinct agents. While the overall system is highly effective, the paper lacks an ablation study to analyze the individual contribution of each agent. For example, how critical is the Metacognitive Reviewer or the Conceptual Mapping Agent to the final score? Understanding the impact of each component would provide deeper insight into the architecture and help identify the most critical elements for pedagogical visualization. 2. The benchmark and age
(1) The formulation of a multi-agent system specifically tailored for pedagogical visualization seems novel. (2) The paper is well-executed, with a rigorous experimental setup involving multiple model families. (3) The writing is clear and well-structured.
(1) While the use of GPT-4o as an automated judge is validated, it remains a single-model evaluator. Including more diverse evaluators (e.g., human teachers, multiple LVLMs) could strengthen the reliability of the scoring system. (2) The paper does not include an ablation study to analyze the contribution of each agent in EduVisAgent. Understanding which components are most critical would help future researchers prioritize agent design. (3) The multi-agent system is computationally intensive.
- Developed EduVisBench, a benchmark with richer information compared to datasets from existing generative models. - Successfully validated the reliability of the LLM-based automatic evaluation system using human assessments shown in Table 2. - Utilized five specialized multi agents to implement strategies. - Tested broad generalization capabilities across three major academic domains: mathematics, physics, and chemistry.
- A detailed explanation of the dataset utilized for evaluation is required, as the content presented in Figure 3 is unclear. - The input prompts used for the LLM evaluation in Table 1 were not disclosed, making the verification of fairness difficult. - The description of each EduVisAgent is too simple, leaving the method of theory implementation unclear. - The description of how the theory was implemented in the system is lacking. Openness regarding the benchmark and the educational da
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Model-Driven Software Engineering Techniques · Intelligent Tutoring Systems and Adaptive Learning
MethodsFocus · Diffusion
