TalkVerse: Democratizing Minute-Long Audio-Driven Video Generation

Zhenzhi Wang; Jian Wang; Ke Ma; Dahua Lin; Bing Zhou

arXiv:2512.14938·cs.CV·December 18, 2025

TalkVerse: Democratizing Minute-Long Audio-Driven Video Generation

Zhenzhi Wang, Jian Wang, Ke Ma, Dahua Lin, Bing Zhou

PDF

Open Access

TL;DR

TalkVerse introduces a large, open dataset and a scalable model for minute-long, audio-driven talking video generation, enabling fair comparison and advancing research with lower inference costs.

Contribution

It provides a comprehensive, open dataset and a reproducible baseline model for audio-driven talking video generation, facilitating fair benchmarking and reducing computational barriers.

Findings

01

Achieves high-quality, minute-long video generation with low drift.

02

Delivers lip-sync and visual quality comparable to larger models.

03

Supports zero-shot video dubbing with controlled latent noise.

Abstract

We introduce TalkVerse, a large-scale, open corpus for single-person, audio-driven talking video generation designed to enable fair, reproducible comparison across methods. While current state-of-the-art systems rely on closed data or compute-heavy models, TalkVerse offers 2.3 million high-resolution (720p/1080p) audio-video synchronized clips totaling 6.3k hours. These are curated from over 60k hours of video via a transparent pipeline that includes scene-cut detection, aesthetic assessment, strict audio-visual synchronization checks, and comprehensive annotations including 2D skeletons and structured visual/audio-style captions. Leveraging TalkVerse, we present a reproducible 5B DiT baseline built on Wan2.2-5B. By utilizing a video VAE with a high downsampling ratio and a sliding window mechanism with motion-frame context, our model achieves minute-long generation with low drift. It…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Video Analysis and Summarization