Crab: Multi Layer Contrastive Supervision to Improve Speech Emotion Recognition Under Both Acted and Natural Speech Condition
Lucas H. Ueda, Jo\~ao G. T. Lima, Paula D. P. Costa

TL;DR
Crab introduces a multi-layer contrastive supervision strategy within a bimodal transformer architecture to enhance speech emotion recognition, especially in natural and imbalanced data scenarios, outperforming existing methods.
Contribution
The paper proposes a novel Multi Layer Contrastive Supervision approach integrated into a bimodal transformer for improved SER performance.
Findings
Consistently outperforms baseline models across datasets.
Significant gains in naturalistic and imbalanced conditions.
Effective multi-layer contrastive learning without extra inference parameters.
Abstract
Speech Emotion Recognition (SER) in real-world scenarios remains challenging due to severe class imbalance and the prevalence of spontaneous, natural speech. While recent approaches leverage self-supervised learning (SSL) representations and multimodal fusion of speech and text, most existing methods apply supervision only at the final classification layer, limiting the discriminative power of intermediate representations. In this work, we propose Crab (Contrastive Representation and Multimodal Aligned Bottleneck), a bimodal Cross-Modal Transformer architecture that integrates speech representations from WavLM and textual representations from RoBERTa, together with a novel \textit{Multi Layer Contrastive Supervision} (MLCS) strategy. MLCS injects multi-positive contrastive learning signals at multiple layers of the network, encouraging emotionally discriminative representations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Voice and Speech Disorders
