CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models
Sreyan Ghosh, Ashish Seth, Sonal Kumar, Utkarsh Tyagi, Chandra Kiran, Evuru, S. Ramaneswaran, S. Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh, Manocha

TL;DR
This paper introduces CompA, benchmarks for evaluating compositional reasoning in audio-language models, and proposes CompA-CLAP, a fine-tuned model that significantly improves reasoning abilities on these benchmarks.
Contribution
The paper presents novel benchmarks for assessing compositional reasoning in ALMs and a new training method that enhances model performance on these tasks.
Findings
Current ALMs perform only marginally better than random chance on compositional reasoning.
CompA-CLAP significantly outperforms baseline models on the CompA benchmarks.
Proposed training improvements enable better understanding of acoustic event order and attributes.
Abstract
A fundamental characteristic of audio is its compositional nature. Audio-language models (ALMs) trained using a contrastive approach (e.g., CLAP) that learns a shared representation between audio and language modalities have improved performance in many downstream applications, including zero-shot audio classification, audio retrieval, etc. However, the ability of these models to effectively perform compositional reasoning remains largely unexplored and necessitates additional research. In this paper, we propose CompA, a collection of two expert-annotated benchmarks with a majority of real-world audio samples, to evaluate compositional reasoning in ALMs. Our proposed CompA-order evaluates how well an ALM understands the order or occurrence of acoustic events in audio, and CompA-attribute evaluates attribute-binding of acoustic events. An instance from either benchmark consists of two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMusic and Audio Processing · Diverse Musicological Studies · Speech Recognition and Synthesis
