MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation

Longtao Zheng; Yifan Zhang; Hanzhong Guo; Jiachun Pan; Zhenxiong Tan,; Jiahao Lu; Chuanxin Tang; Bo An; Shuicheng Yan

arXiv:2412.04448·cs.CV·December 6, 2024

MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation

Longtao Zheng, Yifan Zhang, Hanzhong Guo, Jiachun Pan, Zhenxiong Tan,, Jiahao Lu, Chuanxin Tang, Bo An, Shuicheng Yan

PDF

Open Access 2 Models 1 Datasets

TL;DR

MEMO introduces a memory-guided, emotion-aware diffusion approach for generating realistic, expressive talking videos with improved synchronization, identity consistency, and natural expressions, addressing key challenges in audio-driven video synthesis.

Contribution

The paper presents a novel end-to-end framework combining memory-guided temporal modeling and emotion-aware audio processing for enhanced talking video generation.

Findings

01

Outperforms state-of-the-art methods in quality and synchronization

02

Enhances long-term identity consistency and expression realism

03

Produces diverse, natural, and emotion-aligned talking videos

Abstract

Recent advances in video diffusion models have unlocked new potential for realistic audio-driven talking video generation. However, achieving seamless audio-lip synchronization, maintaining long-term identity consistency, and producing natural, audio-aligned expressions in generated talking videos remain significant challenges. To address these challenges, we propose Memory-guided EMOtion-aware diffusion (MEMO), an end-to-end audio-driven portrait animation approach to generate identity-consistent and expressive talking videos. Our approach is built around two key modules: (1) a memory-guided temporal module, which enhances long-term identity consistency and motion smoothness by developing memory states to store information from a longer past context to guide temporal modeling via linear attention; and (2) an emotion-aware audio module, which replaces traditional cross attention with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

memoavatar/memo_data
dataset· 5.8k dl
5.8k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Music and Audio Processing · Generative Adversarial Networks and Image Synthesis

MethodsSoftmax · Attention Is All You Need · Diffusion