ADIFF: Explaining audio difference using natural language
Soham Deshmukh, Shuo Han, Rita Singh, Bhiksha Raj

TL;DR
This paper introduces ADIFF, a novel approach for explaining audio differences using natural language, supported by new datasets, baseline models, and comprehensive evaluations, advancing the interpretability of audio analysis.
Contribution
The paper is the first to systematically study audio difference explanation, proposing new datasets, a novel model architecture, and benchmarks for the task.
Findings
Baseline struggles with similar sounds and detailed explanations.
Proposed ADIFF improves explanation quality and detail.
Model enhancements outperform existing audio-language models.
Abstract
Understanding and explaining differences between audio recordings is crucial for fields like audio forensics, quality assessment, and audio generation. This involves identifying and describing audio events, acoustic scenes, signal characteristics, and their emotional impact on listeners. This paper stands out as the first work to comprehensively study the task of explaining audio differences and then propose benchmark, baselines for the task. First, we present two new datasets for audio difference explanation derived from the AudioCaps and Clotho audio captioning datasets. Using Large Language Models (LLMs), we generate three levels of difference explanations: (1) concise descriptions of audio events and objects, (2) brief sentences about audio events, acoustic scenes, and signal properties, and (3) comprehensive explanations that include semantics and listener emotions. For the…
Peer Reviews
Decision·ICLR 2025 Spotlight
- Introduction of a new task - The paper addresses a previously underexplored area by defining the audio exploration task and the paper motivates this task well. - Dataset creation and availability - The authors took great care in creating new datasets ACD and CLD for this task - which is a significant contribution for motivating further research on this task. - Novelty in Model Architecture - the authors build upon existing literature to introduce a new cross projection layer that helps com
- Absence of a concrete baseline - the baseline the authors compare ADIFF against is a clearly inferior version of the same model, and thus guaranteeing that ADIFF would perform better than this baseline. This baseline is more of an ablation of the components of the model. - A major contribution of this work is "cross projection" layer that distinguishes ADIFF from baseline and existing literature. However there is Insufficient evidence of importance of the cross projection layer due to two fa
The paper contains three strengths: 1. The paper introduces a considerable novelty by defining a new audio task inspired by the development of multimodal language models. The approach to explaining audio differences effectively enhances the LLM's ability to understand audio inputs at both the semantic and acoustic levels. Additionally, the method incorporates primitives of human perception into the dataset creation process, progressively decomposing audio elements from audio events to acoustic
There is one weakness of this paper: 1. The presentation of the experimental section requires further refinement. The experimental tables are somewhat difficult to follow, as the model names are not consistently listed in most of them. Additionally, highlighting or bolding the best values for each metric would enhance clarity and emphasize the differences more effectively.
The paper proposes a valid methodology in generating "audio difference" captions through LLMs using human captions; the methodology is verified through experiments.
1. The three-tier captioning process seems arbitary; no ablation study is conducted for this (i.e. if trained only on tier-3, can the models have good performance on tier-1? If they are purely hierarchical, then training on tier-3 should in theory yield good results on lower tiers.) 2. Both AudioCaps and Clotho's captioning quality is questionable; WavCaps and other larger caption datasets contain more diversity, yet no experiment is conducted on them. 3. The SOTA model, Qwen-Audio, seems to not
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing
