EduIllustrate: Towards Scalable Automated Generation Of Multimodal Educational Content
Shuzhen Bi, Mingzi Zhang, Zhuoxuan Li, Xiaolong Wang, Keqian Li, and Aimin Zhou

TL;DR
EduIllustrate introduces a comprehensive benchmark for evaluating large language models on generating multimodal educational content, combining text and diagrams for K-12 STEM problems.
Contribution
The paper presents a new benchmark with standardized protocols and evaluation metrics for assessing LLMs' ability to generate coherent, diagram-rich explanations in education.
Findings
Gemini 3.0 Pro achieves 87.8 ext% accuracy on the benchmark.
Sequential anchoring improves visual consistency by 13 ext% and reduces costs.
Human raters validate the reliability of LLMs as objective judges for content quality.
Abstract
Large language models are increasingly used as educational assistants, yet evaluation of their educational capabilities remains concentrated on question-answering and tutoring tasks. A critical gap exists for multimedia instructional content generation -- the ability to produce coherent, diagram-rich explanations that combine geometrically accurate visuals with step-by-step reasoning. We present EduIllustrate, a benchmark for evaluating LLMs on interleaved text-diagram explanation generation for K-12 STEM problems. The benchmark comprises 230 problems spanning five subjects and three grade levels, a standardized generation protocol with sequential anchoring to enforce cross-diagram visual consistency, and an 8-dimension evaluation rubric grounded in multimedia learning theory covering both text and visual quality. Evaluation of ten LLMs reveals a wide performance spread: Gemini 3.0 Pro…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
