From EduVisBench to EduVisAgent: A Benchmark and Multi-Agent Framework for Reasoning-Driven Pedagogical Visualization

Haonian Ji; Shi Qiu; Siyang Xin; Siwei Han; Zhaorun Chen; Dake Zhang; Hongyi Wang; Huaxiu Yao

arXiv:2505.16832·cs.AI·May 29, 2025

From EduVisBench to EduVisAgent: A Benchmark and Multi-Agent Framework for Reasoning-Driven Pedagogical Visualization

Haonian Ji, Shi Qiu, Siyang Xin, Siwei Han, Zhaorun Chen, Dake Zhang, Hongyi Wang, Huaxiu Yao

PDF

Open Access 2 Repos 3 Reviews

TL;DR

This paper introduces EduVisBench, a comprehensive benchmark for evaluating visual reasoning in educational models, and EduVisAgent, a multi-agent framework that significantly improves pedagogical visualizations by coordinating specialized agents.

Contribution

The paper presents EduVisBench for assessing visual reasoning in educational models and proposes EduVisAgent, a multi-agent system that enhances visualization quality and reasoning capabilities.

Findings

01

Existing models struggle with visual reasoning and decomposition.

02

EduVisAgent outperforms baselines with a 40.2% improvement.

03

EduVisAgent produces more educationally aligned visualizations.

Abstract

While foundation models (FMs), such as diffusion models and large vision-language models (LVLMs), have been widely applied in educational contexts, their ability to generate pedagogically effective visual explanations remains limited. Most existing approaches focus primarily on textual reasoning, overlooking the critical role of structured and interpretable visualizations in supporting conceptual understanding. To better assess the visual reasoning capabilities of FMs in educational settings, we introduce EduVisBench, a multi-domain, multi-level benchmark. EduVisBench features diverse STEM problem sets requiring visually grounded solutions, along with a fine-grained evaluation rubric informed by pedagogical theory. Our empirical analysis reveals that existing models frequently struggle with the inherent challenge of decomposing complex reasoning and translating it into visual…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. The paper tackles a timely and important problem. As AI becomes more integrated into education, the ability to generate not just correct answers but effective teaching materials is paramount. The focus on pedagogical visualization as a distinct capability gap in FMs is novel and well-motivated. 2. The design of EduVisAgent is not an ad-hoc collection of agents but is thoughtfully grounded in pedagogical theory, mimicking the division of labor in instructional design. The performance improvem

Weaknesses

1. The EduVisAgent framework consists of five distinct agents. While the overall system is highly effective, the paper lacks an ablation study to analyze the individual contribution of each agent. For example, how critical is the Metacognitive Reviewer or the Conceptual Mapping Agent to the final score? Understanding the impact of each component would provide deeper insight into the architecture and help identify the most critical elements for pedagogical visualization. 2. The benchmark and age

Reviewer 02Rating 6Confidence 3

Strengths

(1) The formulation of a multi-agent system specifically tailored for pedagogical visualization seems novel. (2) The paper is well-executed, with a rigorous experimental setup involving multiple model families. (3) The writing is clear and well-structured.

Weaknesses

(1) While the use of GPT-4o as an automated judge is validated, it remains a single-model evaluator. Including more diverse evaluators (e.g., human teachers, multiple LVLMs) could strengthen the reliability of the scoring system. (2) The paper does not include an ablation study to analyze the contribution of each agent in EduVisAgent. Understanding which components are most critical would help future researchers prioritize agent design. (3) The multi-agent system is computationally intensive.

Reviewer 03Rating 2Confidence 3

Strengths

- Developed EduVisBench, a benchmark with richer information compared to datasets from existing generative models. - Successfully validated the reliability of the LLM-based automatic evaluation system using human assessments shown in Table 2. - Utilized five specialized multi agents to implement strategies. - Tested broad generalization capabilities across three major academic domains: mathematics, physics, and chemistry.

Weaknesses

- A detailed explanation of the dataset utilized for evaluation is required, as the content presented in Figure 3 is unclear. - The input prompts used for the LLM evaluation in Table 1 were not disclosed, making the verification of fairness difficult. - The description of each EduVisAgent is too simple, leaving the method of theory implementation unclear. - The description of how the theory was implemented in the system is lacking. Openness regarding the benchmark and the educational da

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Model-Driven Software Engineering Techniques · Intelligent Tutoring Systems and Adaptive Learning

MethodsFocus · Diffusion