MATER: Multi-level Acoustic and Textual Emotion Representation for Interpretable Speech Emotion Recognition
Hyo Jin Jon, Longbin Jin, Hyuntaek Jung, Hyunseo Kim, Donghun Min, Eun Yi Kim

TL;DR
This paper introduces MATER, a hierarchical framework combining acoustic and textual features at multiple levels for more interpretable speech emotion recognition in naturalistic conditions, addressing variability and ambiguity.
Contribution
The paper proposes a novel multi-level hierarchical approach integrating acoustic and textual cues, along with an uncertainty-aware ensemble, to improve speech emotion recognition robustness.
Findings
MATER achieved 41.01% Macro-F1 in emotion recognition.
Secured second place with a CCC of 0.6941 in valence prediction.
Demonstrated effectiveness in handling natural speech variability.
Abstract
This paper presents our contributions to the Speech Emotion Recognition in Naturalistic Conditions (SERNC) Challenge, where we address categorical emotion recognition and emotional attribute prediction. To handle the complexities of natural speech, including intra- and inter-subject variability, we propose Multi-level Acoustic-Textual Emotion Representation (MATER), a novel hierarchical framework that integrates acoustic and textual features at the word, utterance, and embedding levels. By fusing low-level lexical and acoustic cues with high-level contextualized representations, MATER effectively captures both fine-grained prosodic variations and semantic nuances. Additionally, we introduce an uncertainty-aware ensemble strategy to mitigate annotator inconsistencies, improving robustness in ambiguous emotional expressions. MATER ranks fourth in both tasks with a Macro-F1 of 41.01% and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
