PolyBench: A Benchmark for Compositional Reasoning in Polyphonic Audio
Yuanjian Chen, Yang Xiao, Han Yin, Xubo Liu, Jinjie Huang, Ting Dang

TL;DR
PolyBench is a new benchmark designed to evaluate the ability of Large Audio Language Models to perform compositional reasoning tasks in polyphonic audio, addressing a gap in existing evaluation methods.
Contribution
This work introduces PolyBench, a comprehensive benchmark with five subsets to assess reasoning over multiple concurrent sound events in polyphonic audio.
Findings
State-of-the-art LALMs show performance degradation on PolyBench
PolyBench reveals fundamental bottlenecks in current LALMs for polyphonic reasoning
Benchmark covers counting, classification, detection, concurrency, and duration estimation
Abstract
Large Audio Language Models (LALMs) are increasingly capable of reasoning over audio. However, existing benchmarks provide limited coverage of reasoning in polyphonic audio, where multiple sound events co-occur and induce compositional structure. In this work, we introduce PolyBench, a benchmark designed to evaluate compositional reasoning in polyphonic audio. PolyBench comprises five evaluation subsets covering counting, classification, detection, concurrency, and duration estimation, requiring reasoning over multiple concurrent events and their relations. Evaluation of state-of-the-art LALMs reveals consistent performance degradation in polyphonic audio, indicating a fundamental bottleneck in current LALMs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech Recognition and Synthesis
