MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation
Longtao Zheng, Yifan Zhang, Hanzhong Guo, Jiachun Pan, Zhenxiong Tan,, Jiahao Lu, Chuanxin Tang, Bo An, Shuicheng Yan

TL;DR
MEMO introduces a memory-guided, emotion-aware diffusion approach for generating realistic, expressive talking videos with improved synchronization, identity consistency, and natural expressions, addressing key challenges in audio-driven video synthesis.
Contribution
The paper presents a novel end-to-end framework combining memory-guided temporal modeling and emotion-aware audio processing for enhanced talking video generation.
Findings
Outperforms state-of-the-art methods in quality and synchronization
Enhances long-term identity consistency and expression realism
Produces diverse, natural, and emotion-aligned talking videos
Abstract
Recent advances in video diffusion models have unlocked new potential for realistic audio-driven talking video generation. However, achieving seamless audio-lip synchronization, maintaining long-term identity consistency, and producing natural, audio-aligned expressions in generated talking videos remain significant challenges. To address these challenges, we propose Memory-guided EMOtion-aware diffusion (MEMO), an end-to-end audio-driven portrait animation approach to generate identity-consistent and expressive talking videos. Our approach is built around two key modules: (1) a memory-guided temporal module, which enhances long-term identity consistency and motion smoothness by developing memory states to store information from a longer past context to guide temporal modeling via linear attention; and (2) an emotion-aware audio module, which replaces traditional cross attention with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Music and Audio Processing · Generative Adversarial Networks and Image Synthesis
MethodsSoftmax · Attention Is All You Need · Diffusion
