Beyond End-to-End Video Models: An LLM-Based Multi-Agent System for Educational Video Generation

Lingyong Yan; Jiulong Wu; Dong Xie; Weixian Shi; Deguo Xia; Jizhou Huang

arXiv:2602.11790·cs.AI·February 13, 2026

Beyond End-to-End Video Models: An LLM-Based Multi-Agent System for Educational Video Generation

Lingyong Yan, Jiulong Wu, Dong Xie, Weixian Shi, Deguo Xia, Jizhou Huang

PDF

Open Access

TL;DR

LAVES is a hierarchical multi-agent system utilizing large language models to generate high-quality, pedagogically coherent educational videos with high procedural fidelity, reduced costs, and automated production.

Contribution

The paper introduces LAVES, a novel multi-agent framework that decomposes educational video generation into specialized agents coordinated by an orchestrator, improving fidelity and controllability over prior end-to-end models.

Findings

01

Achieves over one million videos per day in large-scale deployment.

02

Reduces production costs by over 95% compared to industry standards.

03

Maintains high acceptance rates for generated educational videos.

Abstract

Although recent end-to-end video generation models demonstrate impressive performance in visually oriented content creation, they remain limited in scenarios that require strict logical rigor and precise knowledge representation, such as instructional and educational media. To address this problem, we propose LAVES, a hierarchical LLM-based multi-agent system for generating high-quality instructional videos from educational problems. The LAVES formulates educational video generation as a multi-objective task that simultaneously demands correct step-by-step reasoning, pedagogically coherent narration, semantically faithful visual demonstrations, and precise audio--visual alignment. To address the limitations of prior approaches--including low procedural fidelity, high production cost, and limited controllability--LAVES decomposes the generation workflow into specialized agents…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Multimodal Machine Learning Applications · Artificial Intelligence in Games