LALM-as-a-Judge: Benchmarking Large Audio-Language Models for Safety Evaluation in Multi-Turn Spoken Dialogues
Amir Ivry, Shinji Watanabe

TL;DR
This paper introduces LALM-as-a-Judge, a benchmark for evaluating large audio-language models as safety judges in multi-turn spoken dialogues, highlighting their capabilities and limitations in detecting harmful content.
Contribution
It presents the first systematic benchmark and analysis of large audio-language models for safety evaluation in spoken dialogues, including a new dataset and evaluation framework.
Findings
Audio and transcription quality significantly affect safety detection performance.
Different model architectures and input modalities exhibit trade-offs between sensitivity and stability.
Transcription errors can notably reduce the effectiveness of safety judgments.
Abstract
Spoken dialogues with and between voice agents are becoming increasingly common, yet assessing them for their socially harmful content such as violence, harassment, and hate remains text-centric and fails to account for audio-specific cues and transcription errors. We present LALM-as-a-Judge, the first controlled benchmark and systematic study of large audio-language models (LALMs) as safety judges for multi-turn spoken dialogues. We generate 24,000 unsafe and synthetic spoken dialogues in English that consist of 3-10 turns, by having a single dialogue turn including content with one of 8 harmful categories (e.g., violence) and on one of 5 grades, from very mild to severe. On 160 dialogues, 5 human raters confirmed reliable unsafe detection and a meaningful severity scale. We benchmark three open-source LALMs: Qwen2-Audio, Audio Flamingo 3, and MERaLiON as zero-shot judges that output a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Voice and Speech Disorders
