BLAB: Brutally Long Audio Bench
Orevaoghene Ahia, Martijn Bartelds, Kabir Ahuja, Hila Gonen, Valentin Hofmann, Siddhant Arora, Shuyue Stella Li, Vishal Puttagunta, Mofetoluwa Adeyemi, Charishma Buchireddy, Ben Walls, Noah Bennett, Shinji Watanabe, Noah A. Smith, Yulia Tsvetkov, Sachin Kumar

TL;DR
BLAB introduces a comprehensive long-form audio benchmark to evaluate audio language models on real-world, lengthy speech segments, revealing current models' struggles with long-duration understanding and complex tasks.
Contribution
The paper presents BLAB, a new challenging benchmark with over 833 hours of long audio, to evaluate and improve long-form audio understanding in language models.
Findings
All evaluated models struggle with long audio tasks.
Performance declines as audio duration increases.
Models rely more on prompts than audio content for understanding.
Abstract
Developing large audio language models (LMs) capable of understanding diverse spoken interactions is essential for accommodating the multimodal nature of human communication and can increase the accessibility of language technologies across different user populations. Recent work on audio LMs has primarily evaluated their performance on short audio segments, typically under 30 seconds, with limited exploration of long-form conversational speech segments that more closely reflect natural user interactions with these models. We introduce Brutally Long Audio Bench (BLAB), a challenging long-form audio benchmark that evaluates audio LMs on localization, duration estimation, emotion, and counting tasks using audio segments averaging 51 minutes in length. BLAB consists of 833+ hours of diverse, full-length audio clips, each paired with human-annotated, text-based natural language questions…
Peer Reviews
Decision·Submitted to ICLR 2026
- Truly long speech (≈50–60 min avg per item) where current LAMs struggle, including controlled noise/silence robustness and position-sensitivity tests. - Clear, reproducible prompt formats and metric definitions per task family.
- Doesn't evaluate many recent models like Gemini 2.5, GPT-4o-audio. - Doesn't evaluate models recent models like Audio Flamingo 3 which claims long audio understanding. - NE & Ad localization are derived by running text-only LMs on transcripts, then mapping spans back via WhisperX timestamps. This makes many items solvable from text alone and entangles evaluation quality with ASR/FA errors rather than acoustic understanding. - Word timestamps over 191 hours are WhisperX with just ~1% correctio
**Strength:** The background of the limitations of current audio language models and the motivations are clearly and logically stated and emphasized throughout the paper. Each category of the reasoning capabilities and the tasks under it are clearly designed and described in detail. The experimental setup is designed carefully and fairly.
**Weakness:** The description of the tasks contained in BLAB and the 4 reasoning skills can be clearer. For instance, after reading the abstract section, and up to the point 'across eight tasks and evaluates four fundamental reasoning skills', the relationship between the eight tasks and the four reasoning skills can be a little confusing. It would be better if you could describe that the 8 tasks you are evaluating are under 4 categories, just like your caption for figure 1, in the abstract, to
- A Novel Benchmark: The paper's primary contribution is the BLAB benchmark itself. It addresses a critical, well-documented gap in the field: the lack of evaluation for audio-grounded reasoning on long-form content (averaging 51 minutes). This moves the community beyond short-clip evaluations to a more realistic and challenging domain. - Transparent Data Collection Pipeline: The authors detail a rigorous data pipeline, using permissively licensed sources from YouTube. This process is strengthen
- The evaluation of the full, long-form benchmark (BLAB) relies almost entirely on closed-source, proprietary models like Gemini. While this effectively demonstrates the benchmark's difficulty, it limits reproducibility and prevents the research community from conducting a deeper analysis of the models' failure modes. The open-source models were only evaluated on the BLAB-MINI subset, so their long-context audio capabilities remain unevaluated. - The design of the word localization task seems co
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
