MedVoiceBias: A Controlled Study of Audio LLM Behavior in Clinical Decision-Making
Zhi Rui Tam, Yun-Nung Chen

TL;DR
This study reveals that audio-based large language models in clinical settings exhibit significant biases influenced by voice characteristics, affecting decision consistency and risking healthcare disparities.
Contribution
It provides a comprehensive controlled evaluation of audio LLM biases in clinical decision-making, highlighting the severity and nature of modality and demographic biases.
Findings
Severe modality bias affects recommendations by up to 35%.
Age disparities in model outputs can reach 12%.
Explicit reasoning reduces gender bias but not emotion-related bias.
Abstract
As large language models transition from text-based interfaces to audio interactions in clinical settings, they might introduce new vulnerabilities through paralinguistic cues in audio. We evaluated these models on 170 clinical cases, each synthesized into speech from 36 distinct voice profiles spanning variations in age, gender, and emotion. Our findings reveal a severe modality bias: surgical recommendations for audio inputs varied by as much as 35% compared to identical text-based inputs, with one model providing 80% fewer recommendations. Further analysis uncovered age disparities of up to 12% between young and elderly voices, which persisted in most models despite chain-of-thought prompting. While explicit reasoning successfully eliminated gender bias, the impact of emotion was not detected due to poor recognition performance. These results demonstrate that audio LLMs are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVoice and Speech Disorders · Artificial Intelligence in Healthcare and Education · Machine Learning in Healthcare
