BLAB: Brutally Long Audio Bench

Orevaoghene Ahia; Martijn Bartelds; Kabir Ahuja; Hila Gonen; Valentin Hofmann; Siddhant Arora; Shuyue Stella Li; Vishal Puttagunta; Mofetoluwa Adeyemi; Charishma Buchireddy; Ben Walls; Noah Bennett; Shinji Watanabe; Noah A. Smith; Yulia Tsvetkov; Sachin Kumar

arXiv:2505.03054·cs.AI·May 14, 2025

BLAB: Brutally Long Audio Bench

Orevaoghene Ahia, Martijn Bartelds, Kabir Ahuja, Hila Gonen, Valentin Hofmann, Siddhant Arora, Shuyue Stella Li, Vishal Puttagunta, Mofetoluwa Adeyemi, Charishma Buchireddy, Ben Walls, Noah Bennett, Shinji Watanabe, Noah A. Smith, Yulia Tsvetkov, Sachin Kumar

PDF

1 Datasets 3 Reviews

TL;DR

BLAB introduces a comprehensive long-form audio benchmark to evaluate audio language models on real-world, lengthy speech segments, revealing current models' struggles with long-duration understanding and complex tasks.

Contribution

The paper presents BLAB, a new challenging benchmark with over 833 hours of long audio, to evaluate and improve long-form audio understanding in language models.

Findings

01

All evaluated models struggle with long audio tasks.

02

Performance declines as audio duration increases.

03

Models rely more on prompts than audio content for understanding.

Abstract

Developing large audio language models (LMs) capable of understanding diverse spoken interactions is essential for accommodating the multimodal nature of human communication and can increase the accessibility of language technologies across different user populations. Recent work on audio LMs has primarily evaluated their performance on short audio segments, typically under 30 seconds, with limited exploration of long-form conversational speech segments that more closely reflect natural user interactions with these models. We introduce Brutally Long Audio Bench (BLAB), a challenging long-form audio benchmark that evaluates audio LMs on localization, duration estimation, emotion, and counting tasks using audio segments averaging 51 minutes in length. BLAB consists of 833+ hours of diverse, full-length audio clips, each paired with human-annotated, text-based natural language questions…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 5

Strengths

- Truly long speech (≈50–60 min avg per item) where current LAMs struggle, including controlled noise/silence robustness and position-sensitivity tests. - Clear, reproducible prompt formats and metric definitions per task family.

Weaknesses

- Doesn't evaluate many recent models like Gemini 2.5, GPT-4o-audio. - Doesn't evaluate models recent models like Audio Flamingo 3 which claims long audio understanding. - NE & Ad localization are derived by running text-only LMs on transcripts, then mapping spans back via WhisperX timestamps. This makes many items solvable from text alone and entangles evaluation quality with ASR/FA errors rather than acoustic understanding. - Word timestamps over 191 hours are WhisperX with just ~1% correctio

Reviewer 02Rating 6Confidence 4

Strengths

**Strength:** The background of the limitations of current audio language models and the motivations are clearly and logically stated and emphasized throughout the paper. Each category of the reasoning capabilities and the tasks under it are clearly designed and described in detail. The experimental setup is designed carefully and fairly.

Weaknesses

**Weakness:** The description of the tasks contained in BLAB and the 4 reasoning skills can be clearer. For instance, after reading the abstract section, and up to the point 'across eight tasks and evaluates four fundamental reasoning skills', the relationship between the eight tasks and the four reasoning skills can be a little confusing. It would be better if you could describe that the 8 tasks you are evaluating are under 4 categories, just like your caption for figure 1, in the abstract, to

Reviewer 03Rating 6Confidence 3

Strengths

- A Novel Benchmark: The paper's primary contribution is the BLAB benchmark itself. It addresses a critical, well-documented gap in the field: the lack of evaluation for audio-grounded reasoning on long-form content (averaging 51 minutes). This moves the community beyond short-clip evaluations to a more realistic and challenging domain. - Transparent Data Collection Pipeline: The authors detail a rigorous data pipeline, using permissively licensed sources from YouTube. This process is strengthen

Weaknesses

- The evaluation of the full, long-form benchmark (BLAB) relies almost entirely on closed-source, proprietary models like Gemini. While this effectively demonstrates the benchmark's difficulty, it limits reproducibility and prevents the research community from conducting a deeper analysis of the models' failure modes. The open-source models were only evaluated on the BLAB-MINI subset, so their long-context audio capabilities remain unevaluated. - The design of the word localization task seems co

Code & Models

Datasets

oreva/blab_long_audio
dataset· 193 dl
193 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.