MERaLiON-SER: Robust Speech Emotion Recognition Model for English and SEA Languages
Hardik B. Sailor, Aw Ai Ti, Chen Fang Yih Nancy, Chiu Ying Lay, Ding Yang, He Yingxu, Jiang Ridong, Li Jingtao, Liao Jingyi, Liu Zhuohan, Lu Yanfeng, Ma Yi, Manas Gupta, Muhammad Huzaifah Bin Md Shahrin, Nabilah Binte Md Johan, Nattadaporn Lertcheva, Pan Chunlei, Pham Minh Duc

TL;DR
MERaLiON-SER is a robust speech emotion recognition model that effectively captures both categorical and dimensional emotions across English and Southeast Asian languages, outperforming existing models and enhancing empathetic audio systems.
Contribution
The paper introduces MERaLiON-SER, a novel multilingual speech emotion recognition model using hybrid loss functions for joint discrete and dimensional emotion modeling, with superior cross-lingual performance.
Findings
Outperforms open-source speech encoders and Audio-LLMs in multilingual settings.
Effectively captures both emotion categories and intensity, valence, dominance.
Demonstrates robustness across English and Southeast Asian languages.
Abstract
We present MERaLiON-SER, a robust speech emotion recognition model designed for English and Southeast Asian languages. The model is trained using a hybrid objective combining weighted categorical cross-entropy and Concordance Correlation Coefficient (CCC) losses for joint discrete and dimensional emotion modelling. This dual approach enables the model to capture both the distinct categories of emotion (like happy or angry) and the fine-grained, such as arousal (intensity), valence (positivity/negativity), and dominance (sense of control), leading to a more comprehensive and robust representation of human affect. Extensive evaluations across multilingual Singaporean languages (English, Chinese, Malay, and Tamil ) and other public benchmarks show that MERaLiON-SER consistently surpasses both open-source speech encoders and large Audio-LLMs. These results underscore the importance of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Sentiment Analysis and Opinion Mining
