Anchoring Emotions in Text: Robust Multimodal Fusion for Mimicry Intensity Estimation

Lingsi Zhu; Yuefeng Zou; Yunxiang Zhang; Naixiang Zheng; Guoyuan Wang; Jun Yu; Jiaen Liang; Wei Huang; Shengping Liu; Ximin Zheng

arXiv:2603.14976·cs.MM·March 17, 2026

Anchoring Emotions in Text: Robust Multimodal Fusion for Mimicry Intensity Estimation

Lingsi Zhu, Yuefeng Zou, Yunxiang Zhang, Naixiang Zheng, Guoyuan Wang, Jun Yu, Jiaen Liang, Wei Huang, Shengping Liu, Ximin Zheng

PDF

Open Access

TL;DR

This paper introduces TAEMI, a multimodal framework that uses textual transcripts as stable anchors to improve emotional mimicry intensity estimation in noisy, real-world environments, achieving state-of-the-art results.

Contribution

The paper proposes a novel Text-Anchored Dual Cross-Attention mechanism and strategies for handling missing data, enhancing robustness and accuracy in multimodal emotion estimation.

Findings

01

TAEMI outperforms baseline methods on the Hume-Vidmimic2 dataset.

02

The framework maintains high performance under noisy and incomplete data conditions.

03

It effectively captures fine-grained emotional variations.

Abstract

Estimating Emotional Mimicry Intensity (EMI) in naturalistic environments is a critical yet challenging task in affective computing. The primary difficulty lies in effectively modeling the complex, nonlinear temporal dynamics across highly heterogeneous modalities, especially when physical signals are corrupted or missing. To tackle this, we propose TAEMI (Text-Anchored Emotional Mimicry Intensity estimation), a novel multimodal framework designed for the 10th ABAW Competition. Motivated by the observation that continuous visual and acoustic signals are highly susceptible to transient environmental noise, we break the traditional symmetric fusion paradigm. Instead, we leverage textual transcript--which inherently encode a stable, time-independent semantic prior--as central anchors. Specifically, we introduce a Text-Anchored Dual Cross-Attention mechanism that utilizes these robust…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Music and Audio Processing · EEG and Brain-Computer Interfaces