PresentAgent: Multimodal Agent for Presentation Video Generation

Jingwei Shi; Zeyu Zhang; Biao Wu; Yanjie Liang; Meng Fang; Ling Chen; Yang Zhao

arXiv:2507.04036·cs.CV·July 8, 2025

PresentAgent: Multimodal Agent for Presentation Video Generation

Jingwei Shi, Zeyu Zhang, Biao Wu, Yanjie Liang, Meng Fang, Ling Chen, Yang Zhao

PDF

1 Repo 1 Datasets 1 Video

TL;DR

PresentAgent is a multimodal system that converts long documents into synchronized presentation videos with narration and visuals, evaluated by a new comprehensive assessment framework, achieving near-human quality.

Contribution

We introduce PresentAgent, a novel multimodal pipeline for automatic presentation video generation from documents, including a new evaluation framework, PresentEval.

Findings

01

Approaches human-level quality in presentation video generation

02

Effective synchronization of visuals and narration achieved

03

PresentEval provides comprehensive multimodal video assessment

Abstract

We present PresentAgent, a multimodal agent that transforms long-form documents into narrated presentation videos. While existing approaches are limited to generating static slides or text summaries, our method advances beyond these limitations by producing fully synchronized visual and spoken content that closely mimics human-style presentations. To achieve this integration, PresentAgent employs a modular pipeline that systematically segments the input document, plans and renders slide-style visual frames, generates contextual spoken narration with large language models and Text-to-Speech models, and seamlessly composes the final video with precise audio-visual alignment. Given the complexity of evaluating such multimodal outputs, we introduce PresentEval, a unified assessment framework powered by Vision-Language Models that comprehensively scores videos across three critical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

AIGeeksGroup/PresentAgent
pytorchOfficial

Datasets

AIGeeksGroup/Doc2Present
dataset· 86 dl
86 dl

Videos

PresentAgent: Multimodal Agent for Presentation Video Generation· underline