Crab: Multi Layer Contrastive Supervision to Improve Speech Emotion Recognition Under Both Acted and Natural Speech Condition

Lucas H. Ueda; Jo\~ao G. T. Lima; Paula D. P. Costa

arXiv:2603.23673·eess.AS·March 26, 2026

Crab: Multi Layer Contrastive Supervision to Improve Speech Emotion Recognition Under Both Acted and Natural Speech Condition

Lucas H. Ueda, Jo\~ao G. T. Lima, Paula D. P. Costa

PDF

Open Access

TL;DR

Crab introduces a multi-layer contrastive supervision strategy within a bimodal transformer architecture to enhance speech emotion recognition, especially in natural and imbalanced data scenarios, outperforming existing methods.

Contribution

The paper proposes a novel Multi Layer Contrastive Supervision approach integrated into a bimodal transformer for improved SER performance.

Findings

01

Consistently outperforms baseline models across datasets.

02

Significant gains in naturalistic and imbalanced conditions.

03

Effective multi-layer contrastive learning without extra inference parameters.

Abstract

Speech Emotion Recognition (SER) in real-world scenarios remains challenging due to severe class imbalance and the prevalence of spontaneous, natural speech. While recent approaches leverage self-supervised learning (SSL) representations and multimodal fusion of speech and text, most existing methods apply supervision only at the final classification layer, limiting the discriminative power of intermediate representations. In this work, we propose Crab (Contrastive Representation and Multimodal Aligned Bottleneck), a bimodal Cross-Modal Transformer architecture that integrates speech representations from WavLM and textual representations from RoBERTa, together with a novel \textit{Multi Layer Contrastive Supervision} (MLCS) strategy. MLCS injects multi-positive contrastive learning signals at multiple layers of the network, encouraging emotionally discriminative representations…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Voice and Speech Disorders