ERM-MinMaxGAP: Benchmarking and Mitigating Gender Bias in Multilingual Multimodal Speech-LLM Emotion Recognition

Zi Haur Pang; Xiaoxue Gao; Tatsuya Kawahara; Nancy F. Chen

arXiv:2603.21050·cs.SD·March 24, 2026

ERM-MinMaxGAP: Benchmarking and Mitigating Gender Bias in Multilingual Multimodal Speech-LLM Emotion Recognition

Zi Haur Pang, Xiaoxue Gao, Tatsuya Kawahara, Nancy F. Chen

PDF

Open Access

TL;DR

This paper introduces a multilingual, multimodal benchmark for speech emotion recognition to analyze gender bias across languages and modalities, and proposes a new training method to mitigate this bias effectively.

Contribution

It presents a novel benchmark for multilingual, multimodal SER and a fairness-aware training approach called ERM-MinMaxGAP to reduce gender bias in speech emotion recognition systems.

Findings

01

Bias varies significantly across languages.

02

Multimodal fusion does not consistently improve fairness.

03

ERM-MinMaxGAP reduces gender bias gap and improves performance.

Abstract

Speech emotion recognition (SER) systems can exhibit gender-related performance disparities, but how such bias manifests in multilingual speech LLMs across languages and modalities is unclear. We introduce a novel multilingual, multimodal benchmark built on MELD-ST, spanning English, Japanese, and German, to quantify language-specific SER performance and gender gaps. We find bias is strongly language-dependent, and multimodal fusion does not reliably improve fairness. To address these, we propose ERM-MinMaxGAP, a fairness-informed training objective, which augments empirical risk minimization (ERM) with a proposed adaptive fairness weight mechanism and a novel MinMaxGAP regularizer on the maximum male-female loss gap within each language and modality. Building upon the Qwen2-Audio backbone, our ERM-MinMaxGAP approach improves multilingual SER performance by 5.5% and 5.0% while reducing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Voice and Speech Disorders