CompA: Addressing the Gap in Compositional Reasoning in Audio-Language   Models

Sreyan Ghosh; Ashish Seth; Sonal Kumar; Utkarsh Tyagi; Chandra Kiran; Evuru; S. Ramaneswaran; S. Sakshi; Oriol Nieto; Ramani Duraiswami; Dinesh; Manocha

arXiv:2310.08753·cs.SD·August 1, 2024·1 cites

CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models

Sreyan Ghosh, Ashish Seth, Sonal Kumar, Utkarsh Tyagi, Chandra Kiran, Evuru, S. Ramaneswaran, S. Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh, Manocha

PDF

Open Access 1 Video

TL;DR

This paper introduces CompA, benchmarks for evaluating compositional reasoning in audio-language models, and proposes CompA-CLAP, a fine-tuned model that significantly improves reasoning abilities on these benchmarks.

Contribution

The paper presents novel benchmarks for assessing compositional reasoning in ALMs and a new training method that enhances model performance on these tasks.

Findings

01

Current ALMs perform only marginally better than random chance on compositional reasoning.

02

CompA-CLAP significantly outperforms baseline models on the CompA benchmarks.

03

Proposed training improvements enable better understanding of acoustic event order and attributes.

Abstract

A fundamental characteristic of audio is its compositional nature. Audio-language models (ALMs) trained using a contrastive approach (e.g., CLAP) that learns a shared representation between audio and language modalities have improved performance in many downstream applications, including zero-shot audio classification, audio retrieval, etc. However, the ability of these models to effectively perform compositional reasoning remains largely unexplored and necessitates additional research. In this paper, we propose CompA, a collection of two expert-annotated benchmarks with a majority of real-world audio samples, to evaluate compositional reasoning in ALMs. Our proposed CompA-order evaluates how well an ALM understands the order or occurrence of acoustic events in audio, and CompA-attribute evaluates attribute-binding of acoustic events. An instance from either benchmark consists of two…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models· slideslive

Taxonomy

TopicsMusic and Audio Processing · Diverse Musicological Studies · Speech Recognition and Synthesis