Multimodal Diffusion Transformer with Memory Bank for Scalable Long-Duration Talking Video Generation
Haojie Zhang, Zhihao Liang, Ruibo Fu, Bingyan Liu, Zhengqi Wen, Xuefei Liu, Jianhua Tao, and Yaling Liang

TL;DR
LetsTalk is a diffusion transformer framework with a memory bank for scalable, high-quality, long-duration talking video generation, addressing issues like visual degradation and temporal artifacts.
Contribution
It introduces a noise-regularized memory bank and a spatiotemporal-aware transformer, improving long-video synthesis quality and efficiency with novel fusion schemes.
Findings
Achieves state-of-the-art quality in long talking video synthesis.
Produces temporally coherent and diverse videos with fewer parameters.
Outperforms previous methods in realism and efficiency.
Abstract
Long-duration talking video synthesis faces enduring challenges in achieving high video quality, portrait consistency, temporal coherence, and computational efficiency. As video length increases, issues such as visual degradation, portrait drift, temporal artifacts, and error accumulation become increasingly problematic, severely affecting the realism and reliability of the results. To address these challenges, we present LetsTalk, a diffusion transformer framework equipped with multimodal guidance and a novel memory bank mechanism, explicitly maintaining contextual continuity and enabling robust, high-quality, and efficient generation of long-duration talking videos. In particular, LetsTalk introduces a noise-regularized memory bank to alleviate error accumulation and sampling artifacts during extended video generation. To further improve efficiency and spatiotemporal modeling,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
