MERaLiON-SER: Robust Speech Emotion Recognition Model for English and SEA Languages

Hardik B. Sailor; Aw Ai Ti; Chen Fang Yih Nancy; Chiu Ying Lay; Ding Yang; He Yingxu; Jiang Ridong; Li Jingtao; Liao Jingyi; Liu Zhuohan; Lu Yanfeng; Ma Yi; Manas Gupta; Muhammad Huzaifah Bin Md Shahrin; Nabilah Binte Md Johan; Nattadaporn Lertcheva; Pan Chunlei; Pham Minh Duc; Siti Maryam Binte Ahmad Subaidi; Siti Umairah Binte Mohammad Salleh; Sun Shuo; Tarun Kumar Vangani; Wang Qiongqiong; Won Cheng Yi Lewis; Wong Heng Meng Jeremy; Wu Jinyang; Zhang Huayun; Zhang Longyin; Zou Xunlong

arXiv:2511.04914·cs.SD·November 13, 2025

MERaLiON-SER: Robust Speech Emotion Recognition Model for English and SEA Languages

Hardik B. Sailor, Aw Ai Ti, Chen Fang Yih Nancy, Chiu Ying Lay, Ding Yang, He Yingxu, Jiang Ridong, Li Jingtao, Liao Jingyi, Liu Zhuohan, Lu Yanfeng, Ma Yi, Manas Gupta, Muhammad Huzaifah Bin Md Shahrin, Nabilah Binte Md Johan, Nattadaporn Lertcheva, Pan Chunlei, Pham Minh Duc

PDF

Open Access 2 Models

TL;DR

MERaLiON-SER is a robust speech emotion recognition model that effectively captures both categorical and dimensional emotions across English and Southeast Asian languages, outperforming existing models and enhancing empathetic audio systems.

Contribution

The paper introduces MERaLiON-SER, a novel multilingual speech emotion recognition model using hybrid loss functions for joint discrete and dimensional emotion modeling, with superior cross-lingual performance.

Findings

01

Outperforms open-source speech encoders and Audio-LLMs in multilingual settings.

02

Effectively captures both emotion categories and intensity, valence, dominance.

03

Demonstrates robustness across English and Southeast Asian languages.

Abstract

We present MERaLiON-SER, a robust speech emotion recognition model designed for English and Southeast Asian languages. The model is trained using a hybrid objective combining weighted categorical cross-entropy and Concordance Correlation Coefficient (CCC) losses for joint discrete and dimensional emotion modelling. This dual approach enables the model to capture both the distinct categories of emotion (like happy or angry) and the fine-grained, such as arousal (intensity), valence (positivity/negativity), and dominance (sense of control), leading to a more comprehensive and robust representation of human affect. Extensive evaluations across multilingual Singaporean languages (English, Chinese, Malay, and Tamil ) and other public benchmarks show that MERaLiON-SER consistently surpasses both open-source speech encoders and large Audio-LLMs. These results underscore the importance of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Sentiment Analysis and Opinion Mining