ESARM: 3D Emotional Speech-to-Animation via Reward Model from   Automatically-Ranked Demonstrations

Xulong Zhang; Xiaoyang Qu; Haoxiang Shi; Chunguang Xiao; Jianzong Wang

arXiv:2411.13089·cs.CV·November 27, 2024

ESARM: 3D Emotional Speech-to-Animation via Reward Model from Automatically-Ranked Demonstrations

Xulong Zhang, Xiaoyang Qu, Haoxiang Shi, Chunguang Xiao, Jianzong Wang

PDF

Open Access

TL;DR

This paper introduces a 3D speech-to-animation framework that uses a reward model and automatic evaluation to generate diverse, emotionally expressive facial animations closely aligned with human preferences.

Contribution

It presents a novel STA model with a reward model and a training approach that enhances emotional depth and diversity in generated animations.

Findings

01

Generated animations are more emotionally expressive.

02

The framework outperforms existing models on quality metrics.

03

Animations better match human preferences.

Abstract

This paper proposes a novel 3D speech-to-animation (STA) generation framework designed to address the shortcomings of existing models in producing diverse and emotionally resonant animations. Current STA models often generate animations that lack emotional depth and variety, failing to align with human expectations. To overcome these limitations, we introduce a novel STA model coupled with a reward model. This combination enables the decoupling of emotion and content under audio conditions through a cross-coupling training approach. Additionally, we develop a training methodology that leverages automatic quality evaluation of generated facial animations to guide the reinforcement learning process. This methodology encourages the STA model to explore a broader range of possibilities, resulting in the generation of diverse and emotionally expressive facial animations of superior quality.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Speech Recognition and Synthesis · Face recognition and analysis

MethodsALIGN