VIBE: Voice-Induced open-ended Bias Evaluation for Large Audio-Language Models via Real-World Speech

Yi-Cheng Lin; Yusuke Hirota; Sung-Feng Huang; Hung-yi Lee

arXiv:2604.17248·eess.AS·April 21, 2026

VIBE: Voice-Induced open-ended Bias Evaluation for Large Audio-Language Models via Real-World Speech

Yi-Cheng Lin, Yusuke Hirota, Sung-Feng Huang, Hung-yi Lee

PDF

TL;DR

VIBE introduces a new framework for evaluating biases in large audio-language models using real-world speech in open-ended tasks, revealing systematic social stereotypes.

Contribution

It offers a novel bias evaluation method that uses real human speech and open-ended tasks, addressing limitations of synthetic benchmarks.

Findings

01

Gender cues cause larger biases than accent cues in LALMs.

02

Current LALMs reproduce social stereotypes in realistic scenarios.

03

VIBE is easily extensible to new tasks and more representative of real-world biases.

Abstract

Large Audio-Language Models (LALMs) are increasingly integrated into daily applications, yet their generative biases remain underexplored. Existing speech fairness benchmarks rely on synthetic speech and Multiple-Choice Questions (MCQs), both offering a fragmented view of fairness. We propose VIBE, a framework that evaluates generative bias through open-ended tasks such as personalized recommendations, using real-world human recordings. Unlike MCQs, our method allows stereotypical associations to manifest organically without predefined options, making it easily extensible to new tasks. Evaluating 11 state-of-the-art LALMs reveals systematic biases in realistic scenarios. We find that gender cues often trigger larger distributional shifts than accent cues, indicating that current LALMs reproduce social stereotypes.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.