ReasonAudio: A Benchmark for Evaluating Reasoning Beyond Matching in Text-Audio Retrieval

Honglei Zhang; Yuting Chen; Chenpeng Hu; Siyue Zhang; Yilei Shi

arXiv:2605.03361·cs.AI·May 7, 2026

ReasonAudio: A Benchmark for Evaluating Reasoning Beyond Matching in Text-Audio Retrieval

Honglei Zhang, Yuting Chen, Chenpeng Hu, Siyue Zhang, Yilei Shi

PDF

TL;DR

ReasonAudio introduces a new benchmark for text-audio retrieval emphasizing reasoning skills like negation, temporal order, and duration, revealing current models' significant limitations in these complex tasks.

Contribution

This work presents the first reasoning-intensive benchmark for text-audio retrieval, highlighting the challenges models face in advanced reasoning tasks beyond semantic matching.

Findings

01

All evaluated models perform poorly on reasoning tasks.

02

Models struggle most with negation and duration reasoning.

03

Multimodal LLM-based embeddings do not retain reasoning abilities after fine-tuning.

Abstract

As multimodal content continues to expand at a rapid pace, audio retrieval has emerged as a key enabling technology for media search, content organization, and intelligent assistants. However, most existing benchmarks concentrate on semantic matching and fail to capture the fact that real-world queries often demand advanced reasoning abilities, including negation understanding, temporal ordering, concurrent event recognition, and duration discrimination. To address this gap, we introduce ReasonAudio, the first reasoning-intensive benchmark for Text-Audio Retrieval, comprising 1,000 queries and 10,000 composite audio clips across five fundamental reasoning tasks: Negation, Order, Overlap, Duration, and Mix. Despite their intuitive nature for humans and straightforward construction, these tasks pose significant challenges to current models. Our evaluation of ten state-of-the-art models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.