EmoCAST: Emotional Talking Portrait via Emotive Text Description

Yiguo Jiang; Xiaodong Cun; Yong Zhang; Yudian Zheng; Fan Tang; Chi-Man Pun

arXiv:2508.20615·cs.CV·December 24, 2025

EmoCAST: Emotional Talking Portrait via Emotive Text Description

Yiguo Jiang, Xiaodong Cun, Yong Zhang, Yudian Zheng, Fan Tang, Chi-Man Pun

PDF

Open Access

TL;DR

EmoCAST is a diffusion-based framework that synthesizes emotionally expressive talking head videos from text, integrating novel modules and a large in-the-wild dataset to enhance control, realism, and emotion accuracy.

Contribution

The paper introduces a new emotion-aware talking head synthesis framework with effective text control modules and a large-scale emotional dataset for improved realism and expressiveness.

Findings

01

Achieves state-of-the-art results in emotional expression and lip-sync accuracy.

02

Effectively models nuanced emotions through novel attention modules.

03

Demonstrates superior performance on in-the-wild datasets.

Abstract

Emotional talking head synthesis aims to generate talking portrait videos with vivid expressions. Existing methods still exhibit limitations in control flexibility, motion naturalness, and expression quality. Moreover, currently available datasets are mainly collected in lab settings, further exacerbating these shortcomings and hindering real-world deployment. To address these challenges, we propose EmoCAST, a diffusion-based talking head framework for precise, text-driven emotional synthesis. Its contributions are threefold: (1) architectural modules that enable effective text control; (2) an emotional talking-head dataset that expands the framework's ability; and (3) training strategies that further improve performance. Specifically, for appearance modeling, emotional prompts are integrated through a text-guided emotive attention module, enhancing spatial knowledge to improve emotion…

Tables4

Table 1. Table 1 : Comparison between ETTH and relevant datasets.

	IDs	Hours	Emo	Emo	Text
			Label	Level	Description
CelebV [39]	5	2	$\times$	$\times$	$\times$
VoxCeleb [22]	1k+	352	$\times$	$\times$	$\times$
VoxCeleb2 [5]	6k+	2442	$\times$	$\times$	$\times$
Hallo3 [7]	N/A	70	$\times$	$\times$	$\times$
CelebV-HQ [47]	15k+	68	$✓$	$\times$	$\times$
MEAD [34]	60	39	$✓$	3	$\times$
EmoTalk3D [11]	30	15	$✓$	2	$\times$
ETTH (Ours)	15k+	158	$✓$	Fine-grained	$✓$

Table 2. Table 2 : Quantitative comparisons with state-of-the-art methods on MEAD [ 34 ] and out-of-domain test sets. We mainly compare our method with diffusion-based methods, and the metrics of GAN-based methods are listed for reference. Best diffusion-based results are highlighted in bold.

Method	Backbone	Emotional Condition	MEAD Testset				In-the-Wild Testset
Method	Backbone	Emotional Condition	$A c c_{e m o} ↑$	LSE-D $↓$	LSE-C $↑$	FID $↓$	$A c c_{e m o} ↑$	LSE-D $↓$	LSE-C $↑$
MakeItTalk [46]	GAN	N/A	12.50%	9.78	5.25	73.92	12.86%	9.95	4.44
SadTalker [43]	GAN	N/A	12.50%	7.49	7.60	62.79	12.50%	7.25	7.18
EAMM [16]	GAN	Video	13.28%	11.11	3.96	76.70	21.79%	9.94	4.16
PD-FGC [33]	GAN	Video	43.75%	8.78	6.01	62.46	40.57%	9.18	5.19
EDTalk [31]	GAN	Video	29.69%	7.17	8.06	59.60	33.57%	7.77	7.00
EAT [8]	GAN	Label	59.77%	7.69	7.91	58.21	32.50%	8.38	6.50
Aniportrait [37]	Diffusion	N/A	12.50%	9.58	4.93	49.46	13.93%	10.35	3.72
Echomimic [2]	Diffusion	N/A	12.50%	8.93	6.02	45.41	14.64%	9.13	5.49
Hallo [41]	Diffusion	N/A	12.50%	8.55	6.43	47.99	12.50%	8.34	6.23
Hallo2 [6]	Diffusion	N/A	12.50%	8.48	6.52	44.62	12.50%	8.39	6.19
Ours	Diffusion	Text Prompt	83.60%	8.67	6.79	35.89	56.43%	8.12	6.94

Table 3. Table 3 : User Study on In-the-Wild test set.

	SadTalker	Hallo2	EAMM	PD-FGC	EAT	Ours
Audio-visual Sync	3.13	3.30	1.29	2.31	3.11	3.68
Video Quality	3.23	3.63	1.20	1.34	2.66	3.83
Emotion Quality	1.49	1.94	1.24	2.48	2.35	3.75

Table 4. Table 4 : Quantitative results of the ablation study on MEAD.

	LSE-D $↓$	LSE-C $↑$	$A c c_{e m o} ↑$
w/o text emotive attention	8.91	6.55	44.92 $%$
w/o emotive audio attention	9.36	5.80	61.72 $%$
w/o emotion-aware sampling	8.82	6.57	21.09 $%$
w/o progressive training	9.99	5.45	51.56 $%$
Ours	8.67	6.79	83.60 $%$

Equations14

L = E_{z_{t}, ϵ, c, t} [∥ ϵ - ϵ_{θ} (z_{t}, c, t) ∥^{2}],

L = E_{z_{t}, ϵ, c, t} [∥ ϵ - ϵ_{θ} (z_{t}, c, t) ∥^{2}],

C A_{f a ce} (Q (z_{t}), K (e_{f}), V (e_{f}))

C A_{f a ce} (Q (z_{t}), K (e_{f}), V (e_{f}))

+ C A_{t e x t} (Q (z_{t}), K (e_{t}), V (e_{t}))

= S o f t ma x (\frac{Q _{z} K _{f}^{T}}{d}) V_{f} + S o f t ma x (\frac{Q _{z} K _{t}^{T}}{d}) V_{t},

f_{e a} = C A (Q (e_{t}), K (e_{a}), V (e_{a})) .

f_{e a} = C A (Q (e_{t}), K (e_{a}), V (e_{a})) .

f_{l i p} = C A (Q (f_{v}), K (f_{e a}), V (f_{e a})) ⊙ M_{l i p},

f_{l i p} = C A (Q (f_{v}), K (f_{e a}), V (f_{e a})) ⊙ M_{l i p},

f_{e x p} = C A (Q (f_{v}), K (f_{e a}), V (f_{e a})) ⊙ M_{e x p},

f_{e x p} = C A (Q (f_{v}), K (f_{e a}), V (f_{e a})) ⊙ M_{e x p},

f_{p ose} = C A (Q (f_{v}), K (f_{e a}), V (f_{e a})) ⊙ M_{p ose},

f_{p ose} = C A (Q (f_{v}), K (f_{e a}), V (f_{e a})) ⊙ M_{p ose},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Sentiment Analysis and Opinion Mining

Full text

EmoCAST: Emotional Talking Portrait via Emotive Text Description

Yiguo Jiang1 Xiaodong Cun211footnotemark: 1 Yong Zhang3 Yudian Zheng1 Fan Tang4 Chi-Man Pun111footnotemark: 1

1University of Macau 2GVC Lab, Great Bay University 3Meituan 4ICT-CAS

Project Page: https://github.com/GVCLab/EmoCAST

Abstract

Emotional talking head synthesis aims to generate talking portrait videos with vivid expressions. Existing methods still exhibit limitations in control flexibility, motion naturalness, and expression quality. Moreover, currently available datasets are mainly collected in lab settings, further exacerbating these shortcomings and hindering real-world deployment. To address these challenges, we propose EmoCAST, a diffusion-based talking head framework for precise, text-driven emotional synthesis. Its contributions are threefold: (1) architectural modules that enable effective text control; (2) an emotional talking-head dataset that expands the framework’s ability; and (3) training strategies that further improve performance. Specifically, for appearance modeling, emotional prompts are integrated through a text-guided emotive attention module, enhancing spatial knowledge to improve emotion understanding. To strengthen audio-emotion alignment, we introduce an emotive audio attention module to capture the interplay between controlled emotion and driving audio, generating emotion-aware features to guide precise facial motion synthesis. Additionally, we construct a large-scale, in-the-wild emotional talking head dataset with emotive text descriptions to optimize the framework’s performance. Based on this dataset, we propose an emotion-aware sampling strategy and a progressive functional training strategy that improve the model’s ability to capture nuanced expressive features and achieve accurate lip-sync. Overall, EmoCAST achieves state-of-the-art performance in generating realistic, emotionally expressive, and audio-synchronized talking-head videos.

1 Introduction

Generating vivid talking avatars has garnered significant attention in recent years. This technology offers diverse applications across multiple fields, including video content creation, animation production, digital humans, virtual reality, and human-machine interaction [1, 46, 28, 32]. Previous works [23, 3, 9, 43, 30, 41] have primarily concentrated on audio-lip synchronization in generated talking-head videos, ignoring the accompanying emotions, which is essential for natural human communication.

Some recent methods [34, 16, 17, 33, 8, 45, 19, 36, 18, 31] have shifted their focus to emotion control in talking portrait generation, aiming to produce expressive and emotionally rich talking heads. However, directly inferring expressions from speech remains challenging [20]. For example, talking head videos generated using only emotionally cued audio often fail to exhibit distinct facial expressions [43, 2, 41]. Consequently, additional emotion-control signals are typically required. Early approaches utilize emotion labels to regulate expression categories in the generated talking videos [34, 15, 8, 40], while others extract expression information directly from an emotional video template [16, 17, 33, 18]. Nonetheless, these methods frequently encounter limitations in flexibility and controllability via the label or the reference video. Besides, since the emotional videos are hard to capture, existing emotional talking head generation datasets are still limited to the laboratory environment with restricted sample sizes and identities.

To address these challenges, we propose a novel diffusion-based framework for emotional talking head generation that leverages natural language for emotion control, thereby enhancing applicability to real-world scenarios. We advance this goal along three axes: (i) design two modules that effectively integrate text control; (ii) construct an in-the-wild talking-head dataset with rich emotion annotations to facilitate accurate emotion modeling; and (iii) propose two training strategies to further optimize the framework. Specifically, to achieve precise text-controlled emotional synthesis, our framework incorporates two key components: a text-guided emotive attention module and an emotive audio attention module. First, we design the text-guided emotive attention module to learn accurate alignment between emotional facial features and corresponding textual prompts in appearance modeling via a decoupled cross-attention mechanism. Beyond investigating the interaction between textual emotional features and facial features, the relationship between emotional features and audio signals requires systematic exploration. Accordingly, the emotive audio attention module aligns emotional information across textual emotion and audio modalities, modeling their correspondence for the facial region.

Furthermore, we construct a large-scale, in-the-wild Emotive Text-to-Talking Head (ETTH) dataset comprising 158 hours of emotional talking-head videos and spanning diverse identities. For each video, we provide accurate abstract emotion labels, fine-grained emotion intensity levels, and rich emotive textual descriptions. Moreover, based on our dataset, we propose two training strategies. First, during expression learning training, instead of using the reference image from the same emotional video, we use a neutral-expression image of the same identity. This method significantly enhances the model’s ability to capture subtle emotional nuances. Second, we propose a progressive functional training strategy that jointly leverages neutral and emotional talking-head datasets, progressively improving the model’s generalization capacity, expression accuracy, and lip-synchronization in a coarse-to-fine manner.

To evaluate the effectiveness of our proposed method, we conduct comprehensive evaluations on both MEAD test set and in-the-wild test set. The experimental results demonstrate that the proposed method achieves state-of-the-art performance in generating realistic, emotionally expressive talking-head videos. On the MEAD test set, our method attains an emotion accuracy of 83.60%, substantially exceeding competing approaches. More importantly, on the out-of-domain, in-the-wild test set, it exhibits superior performance: both emotion accuracy and lip-sync quality surpass those of other methods, indicating strong generalization.

Overall, our main contributions are summarized as:

•

We present EmoCAST, a novel framework for emotional talking portrait generation that integrates user-friendly emotional text prompts to produce lifelike expressions.

•

To enable precise text-driven emotion control, we design two specific modules: a text-guided emotive attention module that aligns facial dynamics with textual prompts while preserving identity, and an emotive audio attention module to model the relationship between controlled emotion and driving speech.

•

We present a large-scale, in-the-wild emotional talking-head dataset with rich annotations, including discrete emotion categories, fine-grained emotion intensity levels, and textual emotion descriptions. We further propose two training strategies, namely emotion-aware sampling and progressive functional training.

•

Extensive experiments demonstrate that our method generates natural, emotionally expressive talking portraits that remain synchronized with the driving audio.

2 Related Work

Audio-Driven Talking Portrait Generation. Audio-driven talking portrait generation aims to create realistic talking head videos synchronized with corresponding speech. Recently, some deep learning-based methods [46, 43, 17, 37, 2, 41, 6] have significantly advanced this domain. MakeItTalk [46] predicts facial landmarks using disentangled audio content and speaker information. SadTalker [43] focuses on separately learning the expression and pose coefficients of a 3D Morphable Model. EchoMimic [2] utilizes audio input and facial landmark to synthesize high-quality talking head videos. Hallo [41] employs a hierarchical audio-driven visual synthesis module to improve the precision of audio-visual alignment. Hallo2 [6] achieves long-duration, high-resolution portrait image animation. The majority of these methods concentrate on generating synchronized mouth movements, neglecting the crucial aspect of emotional control.

Emotional Audio-Driven Talking Portrait. Emotion significantly enhances the vividness and expressiveness of facial animation, thereby profoundly influencing the realism of generated talking portraits. Recently, some audio-driven talking head methods have incorporated emotion control to produce more expressive and realistic talking portraits [34, 16, 33, 8, 18, 19, 36, 11, 31]. EAMM [16] extracts dynamic emotion patterns from a driven video and applies these transferable patterns to generate emotion-consistent talking heads. PD-FGC [33] employs disentangled latent representations to capture facial motion and subsequently inputs these latent into an image generator to synthesize talking heads. EAT [8] achieves emotion control through parameter-efficient adaptation of a pretrained emotion-agnostic talking head model. EDTalk [31] achieves effective emotion control by modeling expressions, mouth movements, and poses within three disentangled latent spaces. TalkCLIP [19] and InstructAvatar [36] rely on text-based control; however, generating accurate and vivid emotional expressions for in-the-wild reference images through textual control still remains challenging.

3 Methodology

As shown in Fig. 2, given a single reference image as the appearance, the driving audio as talking content, and the text prompt for emotion modeling, the proposed EmoCAST generates expressive talking head videos with described emotion. Below, we first introduce the basic knowledge of the diffusion model in Sec. 3.1, which establishes the foundational framework for our method. Sec. 3.2 presents the EmoCAST pipeline with detailed explanations of its components. We then introduce our newly constructed Emotive Text-to-Talking Head (ETTH) dataset in Sec. 3.3. Finally, Sec. 3.4 details the two proposed training strategies.

3.1 Preliminaries: Latent Diffusion Model

Diffusion Models [13, 29], especially the Latent Diffusion Model (LDM) [25], produce data samples from Gaussian noise through iterative denoising steps. These models consist of two distinct phases: forward diffusion and backward denoising. During the forward diffusion process, Gaussian noise is progressively added to the original data. Conversely, the backward denoising process seeks to reconstruct the original data by reversing the noise addition procedure. We leverage the LDM for the talking head video generation task. Specifically, LDM utilizes the encoder $E$ of the pre-trained Variational Autoencoder (VAE) to convert the input image $x$ to the latent space, generating initial latent feature $z_{0}=E(x)$ . Subsequently, Gaussian noise $\epsilon\sim\mathcal{N}(\textbf{0},\textbf{I})$ is gradually added to the latent feature $z_{0}$ over $t$ time steps, progressively diffusing towards the distribution $\mathcal{N}(\textbf{0},\textbf{I})$ . This diffusion process can be represented as: $q(z_{t}|z_{t-1})=\mathcal{N}(z_{t};\sqrt{1-\beta_{t}}z_{t-1},\beta_{t}\textbf{I}),$ where $\beta_{t}$ is a variance schedule. The $z_{t}$ in an arbitrary timestep $t$ of the diffusion process can be expressed as: $q(z_{t}|z_{0})=\mathcal{N}(z_{t};\sqrt{\bar{\alpha}_{t}}z_{0},(1-\bar{\alpha}_{t})\textbf{I}),$ where $\alpha_{t}=1-\beta_{t}$ , $\bar{\alpha}_{t}=\prod_{s=1}^{t}a_{s}$ . Thus, $z_{t}$ can be derived from $z_{0}$ , expressible as a linear combination of $z_{0}$ and the noise $\epsilon$ by $z_{t}=\sqrt{\bar{\alpha}_{t}}z_{0}+\sqrt{{1-\bar{\alpha}_{t}}}\epsilon.$

During denoising, the UNet [26] is trained to predict the added noise $\epsilon$ in the forward diffusion process. Consequently, the target latent $\hat{z_{0}}$ can be iteratively denoised from $z_{t}$ . The objective function for training can be expressed as:

[TABLE]

where $\epsilon_{\theta}$ is the predicted noise by UNet, $c$ is condition set. After getting the target latent $\hat{z_{0}}$ , the reconstructed output image $\hat{x}$ can be generated by a VAE decoder $\hat{x}=D(\hat{z_{0}})$ . In our talking head animation task, we feed several latent features to the denoising network jointly for video modeling.

3.2 Network Structure of EmoCAST

As shown in Fig. 2, our model primarily comprises ReferenceNet and Denoising UNet following the pre-trained Stable Diffusion [25], inspired by prior human animation methods [14, 2, 41]. The ReferenceNet extracts the visual appearance of the reference image and injects these features into Denoising UNet to guide frame generation. Denoising UNet progressively denoise noisy latents to produce emotional frames that maintain visual coherence with the reference image. Since our method is a video generation task, the temporal modules by temporal frame-wise attention [10] are utilized to keep temporal consistency. Besides, audio is injected into the base model via cross-attention as motion control. Based on this network structure, we aim to generate an emotional talking portrait via the additional control text prompt. Thus, we propose a text-guided emotive attention module, which utilizes a decoupled cross-attention mechanism to feed the emotional textual feature into the diffusion model (Sec. 3.2.1). Furthermore, we develop an emotive audio attention module to capture the relationship between emotive text and audio, thereby generating emotion-aware audio features to drive the synthesis of precise facial expression motions (Sec. 3.2.2).

3.2.1 Text-guided Emotive Attention Module.

As illustrated in Fig. 2, this module is designed to integrate face embeddings $e_{f}$ and text embeddings $e_{t}$ into the diffusion model. A straightforward approach is to concatenate textual embeddings $e_{t}$ and facial embeddings $e_{f}$ , integrating them into the model through a shared cross-attention module. However, this method fails to effectively disentangle facial features from text-controlled attributes, often causing both the deterioration of identity-preserving visual features and insufficient learning of facial expressions from the control text. To address this, we employ a decoupled cross-attention mechanism [42], which more effectively captures expression features while preserving identity-related visual information. Specifically, we first employ a pre-trained face encoder to extract facial embeddings $e_{f}$ for identity representation and utilize CLIP [24] to obtain textual embeddings $e_{t}$ for emotion control. Then, we utilize a decoupled cross-attention mechanism with two parallel branches: (1) Facial cross-attention $CA_{face}$ processes interactions between facial embeddings $e_{f}$ and noisy latent $z_{t}$ . (2) Textual cross-attention $CA_{text}$ mediates interaction between textual embeddings $e_{t}$ and noisy latent $z_{t}$ . The final output combines both attention branches via addition:

[TABLE]

where $Q_{z}=W_{Q}z_{t}$ , $K_{f}=W_{K}^{f}e_{f}$ , $V_{f}=W_{V}^{f}e_{f}$ , $K_{t}=W_{K}^{t}e_{t}$ , $V_{t}=W_{V}^{t}e_{t}$ , and $W_{Q}$ , $W_{K}^{f}$ , $W_{V}^{f}$ , $W_{K}^{t}$ , $W_{K}^{t}$ are learnable projection matrices. This design ensures that the generated facial features remain consistent with the reference image while simultaneously synthesizing vivid emotions that align with the provided emotional prompts.

3.2.2 Emotive Audio Attention Module.

To generate dynamic expression motions that are more consistent with emotional audio, we propose emotive audio attention module. This module first aligns audio features with textual emotion features to derive emotion-aware audio features, which are then used to interact with facial features, thereby guiding the generation of realistic dynamic facial expressions. In detail, we first extract audio embedding using a pretrained wav2vec [27]. For textual embeddings, we employ CLIP to provide emotional control information. Next, as shown in Fig. 2, these extracted embeddings along with visual latent representation are jointly fed into the emotive audio attention module. To establish the relationship between textual expression features and audio features, the emotional text embedding $e_{t}$ undergoes a cross-attention operation with the audio embedding $e_{a}$ to obtain the emotion-aware audio feature $f_{ea}$ . The calculation process is illustrated as follows:

[TABLE]

Subsequently, the emotion-aware audio feature $f_{ea}$ and the visual latent feature $f_{v}$ are integrated through cross-attention to capture the relationships between audio and visual components. Following Hallo [41], we implement three distinct cross-attention blocks for lips, expressions, and poses, respectively to extract corresponding features. The process is as follows:

[TABLE]

where $\odot$ is the Hadamard product. $M_{lip}$ , $M_{exp}$ , and $M_{pose}$ denote masks for the lip, expression, and pose regions, respectively. Finally, these features are combined using a convolutional layer and input to the subsequent module.

3.3 Emotive Text-to-Talking Head Dataset

The emotional talking head dataset is significantly smaller in scale compared to the extensive datasets of neutral talking head videos, as in Tab. 1. Furthermore, enabling fine-grained expression control via natural language necessitates datasets with detailed textual descriptions of emotional styles. To bridge these gaps, we introduce an Emotive Text-to-Talking Head (ETTH) dataset featuring both accurate expression labels and rich emotive textual descriptions. Thus, we label the following datasets MEAD [34], HDTF [44], CelebV-HQ [47], Hallo3 [7] from emotional aspects.

In detail, we process the collected videos in three steps to meet our task requirements, including: lip synchronization filtering, emotion label annotation, and the generation of emotive text descriptions. For lip-sync, we use SyncNet [4] to obtain the Syn-C and Syn-D scores. This enables us to flexibly filter videos based on these metrics to meet diverse data requirements. Regarding emotion labels, we directly utilize the dataset-provided labels for the lab-collected MEAD videos. In the case of Hallo3 and CelebV-HQ, we employ Emotion-FAN [21] that is fine-tuned on MEAD to generate abstract emotion labels and associated intensity values. To generate emotional text prompts, we refer to MMHead [38] by providing ChatGPT with the video’s abstract emotion label, enabling it to generate textual scene descriptions that evoke the target emotion. The statistics of our ETTH dataset are detailed in Table 1. Our dataset encompasses a diverse range of speaker identities and includes comprehensive facial expression annotations. More details of the ETTH dataset are provided in the supplement.

3.4 Progressive Emotion-aware Training

Efficient use of our proposed dataset is critical for training high-performing emotional talking-head models. We demonstrate that training strategies are pivotal and introduce two key strategies. First, we propose emotion-aware sampling strategy (Sec. 3.4.1), which enhances emotion modeling by learning the transformation from neutral to expressive facial representations. Second, we design a progressive functional training (Sec. 3.4.2), a coarse-to-fine scheme that hierarchically refines overall motion, emotional expression, and lip synchronization. Concretely, our method initially trains the spatial layers to capture image-level expression information, enabling emotion-conditioned image-to-image generation. Then, the model is trained for temporal modeling. Building upon the learned emotional image generation, we implement a phased data sampling strategy to achieve audio-driven emotional video synthesis.

3.4.1 Emotion-aware Sampling Training Strategy

In the first training stage for emotional image-to-image generation, we employ an emotion-aware sampling strategy to enable effective learning of the distinctive characteristics of diverse emotional expressions. Specifically, when training on a specific emotion, we avoid sampling both reference and target images from the same emotional video sequence. Instead, the target image is randomly sampled from the corresponding emotional video, while the reference image is randomly selected from the neutral expression video of the same identity as shown in Fig.3. This approach strengthens the model’s ability to discern the differences between various expressions and neutral expressions, thereby improving its capacity to capture expression-specific features.

3.4.2 Progressive Functional Training Strategy

As illustrated in Fig. 3, we introduce a progressive functional training strategy implemented in three phases:

Phase 1 (Generalization Enhancement): First, we train the model on a mixed dataset including the in-the-wild emotional talking videos spanning diverse identities. This phase enhances the model’s generalization capability across diverse data sources.

Phase 2 (Emotion Refinement): To refine facial expression accuracy and lip-sync, we exclude in-the-wild videos and train solely on a hybrid dataset comprising lab-collected emotional MEAD videos and high-quality lip-sync HDTF videos. This combination of two high-precision datasets ensures robustly generated results, even with limited identity.

Phase 3 (Lip-Sync Specialization): Finally, to maximize lip-sync accuracy, we address potential interference from emotion by introducing an additional training phase. Specifically, we train the model on the HDTF, a high-quality talking-head dataset featuring neutral facial expressions and precise lip synchronization.

With this progressive functional training strategy, our model generates natural, emotionally expressive talking portraits with precise audio-visual synchronization.

4 Experiments and Results

Dataset and Implementation Details. Our method is trained on an NVIDIA H800 GPU, using a batch size of 4 with 512 × 512 pixel videos. For evaluation, following EAT [8] and EDTalk [31], we select four test subjects from MEAD [34] and sample 256 emotional talking head videos, covering all 8 emotions. To further assess generalization performance, we construct an additional in-the-wild out-of-domain test set comprising 7 reference images and 40 audio samples, resulting in 280 synthesized videos spanning 8 distinct emotional categories.

Evaluation Metrics. To evaluate the generated emotional talking portrait videos, we employ several metrics. First, emotional accuracy of videos is assessed using the pre-trained emotion classifier [21], as referenced in EAT [8]. Second, audio-visual synchronization is measured using the lip-sync metrics (LSE-D and LSE-C) from SyncNet [4], as in Wav2Lip [23]. Finally, image quality of the synthesized portraits is evaluated using the Fréchet Inception Distance (FID) [12].

Baselines. We perform a comparative analysis with state-of-the-art methods, including representative emotion-agnostic talking head approaches (MakeItTalk [46], SadTalker [43], Aniportrait [37], Echomimic [2], Hallo [41], Hallo2 [6]) as well as open-source emotion-controllable talking head approaches (EAMM [16], PD-FGC [33], EAT [8], and EDTalk [31]). For text-controlled methods TalkCLIP [19] and InstructAvatar [36], their source codes are not publicly available, making quantitative comparisons infeasible. Accordingly, we extract reference images and driving audio from InstructAvatar’s official demo videos and use our method to generate talking-head videos for visual comparison.

4.1 Comparison with Other Methods

We perform quantitative comparisons with other methods on the MEAD test set [34] and out-of-domain in-the-wild test set. Table 2 shows that our method outperforms competing approaches in emotional accuracy and visual quality, highlighting the effectiveness of our EmoCAST in achieving precise and vivid emotional representations. For audio-visual synchronization, our method performs comparably to existing techniques on the MEAD test set, while demonstrating superior performance on the in-the-wild test set, indicating stronger generalization.

We further conduct visual comparisons with other state-of-the-art methods. As illustrated in Fig. 4, GAN-based methods exhibit lower visual fidelity and emotional expressiveness, leading to perceptibly unnatural emotional talking videos. Although EAMM [16], PD-FGC [33], and EDTalk [31] utilize emotional videos as affective sources, their synthesized facial expressions remain insufficiently pronounced. EAT [8] controls expression generation via emotional labels, enabling it to produce accurate expressions. However, the visual quality of these expressions is suboptimal, and the mouth sometimes fails to close in alignment with the ground truth. Moreover, as shown in Fig. 1, under the same text-control setting, InstructAvatar [35] yields weaker, less natural expressions and exhibits poor identity preservation. In contrast, our approach achieves more vivid and faithful facial emotional details, maintains lip synchronization with the ground-truth lip movements, and robustly preserves identity.

4.2 User Study

To further evaluate the quality of the generated emotional talking portrait videos, we conduct a user study involving 22 participants. The study assesses the videos across three dimensions: emotion quality, audio-visual synchronization, and video quality, with scores ranging from 1 (minimum) to 5 (maximum). We compare 5 baseline methods with our proposed approach by sampling 10 videos from the in-the-wild test set, obtaining a total of 60 videos covering 8 emotions. The results of user study are presented in Table 3.

4.3 Ablation Studies

We conduct comprehensive ablation studies to demonstrate the effectiveness of each design component.

Text-guided Emotive Attention Module. We perform ablation studies to assess the text-guided emotive attention module’s capacity to learn appearance-level emotional cues. We compare integrating emotional text and facial features through a shared cross-attention block with our decoupled emotive module. The results are presented in Table 4 and Fig. 5. Relative to the shared cross-attention baseline, our text-guided decoupled module learns more precise expressions while better preserving identity.

Emotive Audio Attention Module. To evaluate the effectiveness of the interaction between speech and textual emotion, we conduct an ablation study by removing the interaction between textual emotion features and audio features in the emotive audio attention module. As illustrated in Fig. 5 and Table 4, enabling this interaction substantially improves performance, yielding more consistent facial motions that better synchronize with both the speech content and controlled expressions.

Emotion-aware Sampling Training Strategy. To validate the efficacy of our emotion-aware sampling training strategy, we compare it with the original intra-video sampling training mechanism, wherein both the reference image and the target image are selected from the same video. As shown in Fig. 5 and Table 4, our emotion-aware sampling training strategy demonstrates the ability to learn more vivid and accurate expression information.

Progressive Functional Training Strategy. We conduct ablation studies to assess the progressive functional training strategy, which can generate highly natural and emotionally expressive talking portraits with precise audio-visual synchronization in a coarse-to-fine manner. For comparison, we evaluate a single-stage training baseline that uses all data simultaneously. As shown in Table 4 and Fig. 5, the progressive training strategy produces more accurate facial expressions and significantly improves lip synchronization.

5 Conclusion

We propose EmoCAST, a novel diffusion-based framework for generating customized, emotionally expressive talking head videos with flexible natural language for emotional control. The text prompts are efficiently integrated into the network via a text-guided emotive attention module and an emotive audio attention module, considering the relationships between emotion, appearance, and motion. Furthermore, to address the scarcity of emotional datasets, we construct an Emotive Text-to-Talking Head (ETTH) dataset containing precise expression labels and rich emotive textual descriptions. Based on this dataset, we propose an emotion-aware sampling strategy and a progressive functional training strategy, which further improve our model’s expression quality and lip-sync accuracy. Extensive experiments demonstrate that EmoCAST achieves state-of-the-art performance in generating highly natural and customizable expressive talking head videos.

Bibliography47

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bozkurt et al. [2023] Aras Bozkurt, Xiao Junhong, Sarah Lambert, Angelica Pazurek, Helen Crompton, Suzan Koseoglu, Robert Farrow, Melissa Bond, Chrissi Nerantzi, Sarah Honeychurch, et al. Speculative futures on chatgpt and generative artificial intelligence (ai): A collective reflection from the educational landscape. Asian Journal of Distance Education , 18(1):53–130, 2023.
2Chen et al. [2025] Zhiyuan Chen, Jiajiong Cao, Zhiquan Chen, Yuming Li, and Chenguang Ma. Echomimic: Lifelike audio-driven portrait animations through editable landmark conditions. In Proceedings of the AAAI Conference on Artificial Intelligence , pages 2403–2410, 2025.
3Cheng et al. [2022] Kun Cheng, Xiaodong Cun, Yong Zhang, Menghan Xia, Fei Yin, Mingrui Zhu, Xuan Wang, Jue Wang, and Nannan Wang. Videoretalking: Audio-based lip synchronization for talking head video editing in the wild. In SIGGRAPH Asia 2022 Conference Papers , pages 1–9, 2022.
4Chung and Zisserman [2017] Joon Son Chung and Andrew Zisserman. Out of time: automated lip sync in the wild. In Computer Vision–ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part II 13 , pages 251–263. Springer, 2017.
5Chung et al. [2018] Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. Voxceleb 2: Deep speaker recognition. ar Xiv preprint ar Xiv:1806.05622 , 2018.
6Cui et al. [2024 a] Jiahao Cui, Hui Li, Yao Yao, Hao Zhu, Hanlin Shang, Kaihui Cheng, Hang Zhou, Siyu Zhu, and Jingdong Wang. Hallo 2: Long-duration and high-resolution audio-driven portrait image animation. ar Xiv preprint ar Xiv:2410.07718 , 2024 a.
7Cui et al. [2024 b] Jiahao Cui, Hui Li, Yun Zhan, Hanlin Shang, Kaihui Cheng, Yuqi Ma, Shan Mu, Hang Zhou, Jingdong Wang, and Siyu Zhu. Hallo 3: Highly dynamic and realistic portrait image animation with video diffusion transformer. ar Xiv preprint ar Xiv:2412.00733 , 2024 b.
8Gan et al. [2023] Yuan Gan, Zongxin Yang, Xihang Yue, Lingyun Sun, and Yi Yang. Efficient emotional adaptation for audio-driven talking-head generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages 22634–22645, 2023.