Audio Entailment: Assessing Deductive Reasoning for Audio Understanding
Soham Deshmukh, Shuo Han, Hazim Bukhari, Benjamin Elizalde, Hannes, Gamper, Rita Singh, Bhiksha Raj

TL;DR
This paper introduces the task of Audio Entailment to evaluate the deductive reasoning ability of Audio-Language Models, revealing their limitations and proposing a captioning-based intermediate step to improve reasoning performance.
Contribution
The paper defines a new benchmark for logical reasoning in audio understanding and demonstrates how captioning can enhance ALMs' reasoning capabilities.
Findings
ALMs show deficiencies in logical reasoning tasks.
Caption-before-reason improves reasoning performance.
Benchmark datasets reveal reasoning limitations in current models.
Abstract
Recent literature uses language to build foundation models for audio. These Audio-Language Models (ALMs) are trained on a vast number of audio-text pairs and show remarkable performance in tasks including Text-to-Audio Retrieval, Captioning, and Question Answering. However, their ability to engage in more complex open-ended tasks, like Interactive Question-Answering, requires proficiency in logical reasoning -- a skill not yet benchmarked. We introduce the novel task of Audio Entailment to evaluate an ALM's deductive reasoning ability. This task assesses whether a text description (hypothesis) of audio content can be deduced from an audio recording (premise), with potential conclusions being entailment, neutral, or contradiction, depending on the sufficiency of the evidence. We create two datasets for this task with audio recordings sourced from two audio captioning datasets --…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSubtitles and Audiovisual Media
