TalkVerse: Democratizing Minute-Long Audio-Driven Video Generation
Zhenzhi Wang, Jian Wang, Ke Ma, Dahua Lin, Bing Zhou

TL;DR
TalkVerse introduces a large, open dataset and a scalable model for minute-long, audio-driven talking video generation, enabling fair comparison and advancing research with lower inference costs.
Contribution
It provides a comprehensive, open dataset and a reproducible baseline model for audio-driven talking video generation, facilitating fair benchmarking and reducing computational barriers.
Findings
Achieves high-quality, minute-long video generation with low drift.
Delivers lip-sync and visual quality comparable to larger models.
Supports zero-shot video dubbing with controlled latent noise.
Abstract
We introduce TalkVerse, a large-scale, open corpus for single-person, audio-driven talking video generation designed to enable fair, reproducible comparison across methods. While current state-of-the-art systems rely on closed data or compute-heavy models, TalkVerse offers 2.3 million high-resolution (720p/1080p) audio-video synchronized clips totaling 6.3k hours. These are curated from over 60k hours of video via a transparent pipeline that includes scene-cut detection, aesthetic assessment, strict audio-visual synchronization checks, and comprehensive annotations including 2D skeletons and structured visual/audio-style captions. Leveraging TalkVerse, we present a reproducible 5B DiT baseline built on Wan2.2-5B. By utilizing a video VAE with a high downsampling ratio and a sliding window mechanism with motion-frame context, our model achieves minute-long generation with low drift. It…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Video Analysis and Summarization
