# EmoCAST: Emotional Talking Portrait via Emotive Text Description

**Authors:** Yiguo Jiang, Xiaodong Cun, Yong Zhang, Yudian Zheng, Fan Tang, Chi-Man Pun

arXiv: 2508.20615 · 2025-12-24

## TL;DR

EmoCAST is a diffusion-based framework that synthesizes emotionally expressive talking head videos from text, integrating novel modules and a large in-the-wild dataset to enhance control, realism, and emotion accuracy.

## Contribution

The paper introduces a new emotion-aware talking head synthesis framework with effective text control modules and a large-scale emotional dataset for improved realism and expressiveness.

## Key findings

- Achieves state-of-the-art results in emotional expression and lip-sync accuracy.
- Effectively models nuanced emotions through novel attention modules.
- Demonstrates superior performance on in-the-wild datasets.

## Abstract

Emotional talking head synthesis aims to generate talking portrait videos with vivid expressions. Existing methods still exhibit limitations in control flexibility, motion naturalness, and expression quality. Moreover, currently available datasets are mainly collected in lab settings, further exacerbating these shortcomings and hindering real-world deployment. To address these challenges, we propose EmoCAST, a diffusion-based talking head framework for precise, text-driven emotional synthesis. Its contributions are threefold: (1) architectural modules that enable effective text control; (2) an emotional talking-head dataset that expands the framework's ability; and (3) training strategies that further improve performance. Specifically, for appearance modeling, emotional prompts are integrated through a text-guided emotive attention module, enhancing spatial knowledge to improve emotion understanding. To strengthen audio-emotion alignment, we introduce an emotive audio attention module to capture the interplay between controlled emotion and driving audio, generating emotion-aware features to guide precise facial motion synthesis. Additionally, we construct a large-scale, in-the-wild emotional talking head dataset with emotive text descriptions to optimize the framework's performance. Based on this dataset, we propose an emotion-aware sampling strategy and a progressive functional training strategy that improve the model's ability to capture nuanced expressive features and achieve accurate lip-sync. Overall, EmoCAST achieves state-of-the-art performance in generating realistic, emotionally expressive, and audio-synchronized talking-head videos. Project Page: https://github.com/GVCLab/EmoCAST

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.20615/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/2508.20615/full.md

## References

47 references — full list in the complete paper: https://tomesphere.com/paper/2508.20615/full.md

---
Source: https://tomesphere.com/paper/2508.20615