Prosody as Supervision: Bridging the Non-Verbal--Verbal for Multilingual Speech Emotion Recognition
Girish, Mohd Mujtaba Akhtar, Muskaan Singh

TL;DR
This paper introduces NOVA ARC, a novel geometry-aware framework that leverages non-verbal vocalizations to improve multilingual speech emotion recognition, especially in low-resource settings, by transferring supervision from non-verbal to verbal speech.
Contribution
It proposes a new non-verbal-to-verbal transfer paradigm and a geometry-aware model for multilingual speech emotion recognition, outperforming existing Euclidean and SSL baselines.
Findings
NOVA ARC achieves the strongest performance in non-verbal-to-verbal adaptation.
It outperforms Euclidean counterparts and strong SSL baselines.
First to introduce non-verbal-to-verbal transfer for SER.
Abstract
In this work, we introduce a paralinguistic supervision paradigm for low-resource multilingual speech emotion recognition (LRM-SER) that leverages non-verbal vocalizations to exploit prosody-centric emotion cues. Unlike conventional SER systems that rely heavily on labeled verbal speech and suffer from poor cross-lingual transfer, our approach reformulates LRM-SER as non-verbal-to-verbal transfer, where supervision from a labeled non-verbal source domain is adapted to unlabeled verbal speech across multiple target languages. To this end, we propose NOVA ARC, a geometry-aware framework that models affective structure in the Poincar\'e ball, discretizes paralinguistic patterns via a hyperbolic vector-quantized prosody codebook, and captures emotion intensity through a hyperbolic emotion lens. For unsupervised adaptation, NOVA-ARC performs optimal transport based prototype alignment…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
