ReasonAudio: A Benchmark for Evaluating Reasoning Beyond Matching in Text-Audio Retrieval
Honglei Zhang, Yuting Chen, Chenpeng Hu, Siyue Zhang, Yilei Shi

TL;DR
ReasonAudio introduces a new benchmark for text-audio retrieval emphasizing reasoning skills like negation, temporal order, and duration, revealing current models' significant limitations in these complex tasks.
Contribution
This work presents the first reasoning-intensive benchmark for text-audio retrieval, highlighting the challenges models face in advanced reasoning tasks beyond semantic matching.
Findings
All evaluated models perform poorly on reasoning tasks.
Models struggle most with negation and duration reasoning.
Multimodal LLM-based embeddings do not retain reasoning abilities after fine-tuning.
Abstract
As multimodal content continues to expand at a rapid pace, audio retrieval has emerged as a key enabling technology for media search, content organization, and intelligent assistants. However, most existing benchmarks concentrate on semantic matching and fail to capture the fact that real-world queries often demand advanced reasoning abilities, including negation understanding, temporal ordering, concurrent event recognition, and duration discrimination. To address this gap, we introduce ReasonAudio, the first reasoning-intensive benchmark for Text-Audio Retrieval, comprising 1,000 queries and 10,000 composite audio clips across five fundamental reasoning tasks: Negation, Order, Overlap, Duration, and Mix. Despite their intuitive nature for humans and straightforward construction, these tasks pose significant challenges to current models. Our evaluation of ten state-of-the-art models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
