MATER: Multi-level Acoustic and Textual Emotion Representation for Interpretable Speech Emotion Recognition

Hyo Jin Jon; Longbin Jin; Hyuntaek Jung; Hyunseo Kim; Donghun Min; Eun Yi Kim

arXiv:2506.19887·eess.AS·October 15, 2025

MATER: Multi-level Acoustic and Textual Emotion Representation for Interpretable Speech Emotion Recognition

Hyo Jin Jon, Longbin Jin, Hyuntaek Jung, Hyunseo Kim, Donghun Min, Eun Yi Kim

PDF

TL;DR

This paper introduces MATER, a hierarchical framework combining acoustic and textual features at multiple levels for more interpretable speech emotion recognition in naturalistic conditions, addressing variability and ambiguity.

Contribution

The paper proposes a novel multi-level hierarchical approach integrating acoustic and textual cues, along with an uncertainty-aware ensemble, to improve speech emotion recognition robustness.

Findings

01

MATER achieved 41.01% Macro-F1 in emotion recognition.

02

Secured second place with a CCC of 0.6941 in valence prediction.

03

Demonstrated effectiveness in handling natural speech variability.

Abstract

This paper presents our contributions to the Speech Emotion Recognition in Naturalistic Conditions (SERNC) Challenge, where we address categorical emotion recognition and emotional attribute prediction. To handle the complexities of natural speech, including intra- and inter-subject variability, we propose Multi-level Acoustic-Textual Emotion Representation (MATER), a novel hierarchical framework that integrates acoustic and textual features at the word, utterance, and embedding levels. By fusing low-level lexical and acoustic cues with high-level contextualized representations, MATER effectively captures both fine-grained prosodic variations and semantic nuances. Additionally, we introduce an uncertainty-aware ensemble strategy to mitigate annotator inconsistencies, improving robustness in ambiguous emotional expressions. MATER ranks fourth in both tasks with a Macro-F1 of 41.01% and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.