LALM-as-a-Judge: Benchmarking Large Audio-Language Models for Safety Evaluation in Multi-Turn Spoken Dialogues

Amir Ivry; Shinji Watanabe

arXiv:2602.04796·eess.AS·February 5, 2026

LALM-as-a-Judge: Benchmarking Large Audio-Language Models for Safety Evaluation in Multi-Turn Spoken Dialogues

Amir Ivry, Shinji Watanabe

PDF

Open Access

TL;DR

This paper introduces LALM-as-a-Judge, a benchmark for evaluating large audio-language models as safety judges in multi-turn spoken dialogues, highlighting their capabilities and limitations in detecting harmful content.

Contribution

It presents the first systematic benchmark and analysis of large audio-language models for safety evaluation in spoken dialogues, including a new dataset and evaluation framework.

Findings

01

Audio and transcription quality significantly affect safety detection performance.

02

Different model architectures and input modalities exhibit trade-offs between sensitivity and stability.

03

Transcription errors can notably reduce the effectiveness of safety judgments.

Abstract

Spoken dialogues with and between voice agents are becoming increasingly common, yet assessing them for their socially harmful content such as violence, harassment, and hate remains text-centric and fails to account for audio-specific cues and transcription errors. We present LALM-as-a-Judge, the first controlled benchmark and systematic study of large audio-language models (LALMs) as safety judges for multi-turn spoken dialogues. We generate 24,000 unsafe and synthetic spoken dialogues in English that consist of 3-10 turns, by having a single dialogue turn including content with one of 8 harmful categories (e.g., violence) and on one of 5 grades, from very mild to severe. On 160 dialogues, 5 human raters confirmed reliable unsafe detection and a meaningful severity scale. We benchmark three open-source LALMs: Qwen2-Audio, Audio Flamingo 3, and MERaLiON as zero-shot judges that output a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Voice and Speech Disorders