Learning Frame-Wise Emotion Intensity for Audio-Driven Talking-Head   Generation

Jingyi Xu; Hieu Le; Zhixin Shu; Yang Wang; Yi-Hsuan Tsai; and Dimitris Samaras

arXiv:2409.19501·cs.SD·October 1, 2024

Learning Frame-Wise Emotion Intensity for Audio-Driven Talking-Head Generation

Jingyi Xu, Hieu Le, Zhixin Shu, Yang Wang, Yi-Hsuan Tsai, and Dimitris Samaras

PDF

Open Access

TL;DR

This paper introduces a novel framework for generating emotionally expressive talking-head videos by modeling continuous emotion intensity fluctuations, capturing subtle dynamic changes during speech for more realistic and expressive outputs.

Contribution

It proposes a continuous emotion latent space and an audio-to-intensity predictor trained with pseudo-labels, enabling precise control and realistic modeling of emotion intensity dynamics.

Findings

01

Effective in capturing emotion intensity fluctuations

02

Enhances realism and expressiveness of generated talking-heads

03

Validated through extensive experiments

Abstract

Human emotional expression is inherently dynamic, complex, and fluid, characterized by smooth transitions in intensity throughout verbal communication. However, the modeling of such intensity fluctuations has been largely overlooked by previous audio-driven talking-head generation methods, which often results in static emotional outputs. In this paper, we explore how emotion intensity fluctuates during speech, proposing a method for capturing and generating these subtle shifts for talking-head generation. Specifically, we develop a talking-head framework that is capable of generating a variety of emotions with precise control over intensity levels. This is achieved by learning a continuous emotion latent space, where emotion types are encoded within latent orientations and emotion intensity is reflected in latent norms. In addition, to capture the dynamic intensity fluctuations, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and dialogue systems · Music Technology and Sound Studies