PresentAgent-2: Towards Generalist Multimodal Presentation Agents
Wei Wu, Ziyang Xu, Zeyu Zhang, Yang Zhao, Hao Tang

TL;DR
PresentAgent-2 is a framework that generates multimodal presentation videos from user queries, supporting various modes like single presentation, discussion, and interaction, with a new benchmark for evaluation.
Contribution
It introduces a unified framework for query-driven, multimodal presentation video generation supporting multiple interaction modes and provides a new benchmark for evaluation.
Findings
Supports three presentation modes: single, discussion, interaction.
Generates multimodal content including text, images, GIFs, videos.
Establishes a multimodal presentation benchmark with diverse evaluation criteria.
Abstract
Presentation generation is moving beyond static slide creation toward end-to-end presentation video generation with research grounding, multimodal media, and interactive delivery. We introduce PresentAgent-2, an agentic framework for generating presentation videos from user queries. Given an open-ended user query and a selected presentation mode, PresentAgent-2 first summarizes the query into a focused topic and performs deep research over presentation-friendly sources to collect multimodal resources, including relevant text, images, GIFs, and videos. It then constructs presentation slides, generates mode-specific scripts, and composes slides, audio, and dynamic media into a complete presentation video. PresentAgent-2 supports three independent presentation modes within a unified framework: Single Presentation, which generates a single-speaker narrated presentation video; Discussion,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
