Do Audio-Language Models Understand Linguistic Variations?
Ramaneswaran Selvakumar, Sonal Kumar, Hemant Kumar Giri, Nishit Anand,, Ashish Seth, Sreyan Ghosh, Dinesh Manocha

TL;DR
This paper investigates the limitations of current audio-language models in handling linguistic variations and introduces RobustCLAP, a new training method that improves robustness and retrieval performance by using multi-view contrastive learning with paraphrases.
Contribution
The paper presents RobustCLAP, a novel, efficient training technique that enhances audio-language models' ability to generalize across linguistic variations using multi-view contrastive learning.
Findings
Improves text-to-audio retrieval by 0.8%-13% across benchmarks.
Enhances robustness to linguistic variations.
Demonstrates that existing ALMs struggle with linguistic diversity.
Abstract
Open-vocabulary audio language models (ALMs), like Contrastive Language Audio Pretraining (CLAP), represent a promising new paradigm for audio-text retrieval using natural language queries. In this paper, for the first time, we perform controlled experiments on various benchmarks to show that existing ALMs struggle to generalize to linguistic variations in textual queries. To address this issue, we propose RobustCLAP, a novel and compute-efficient technique to learn audio-language representations agnostic to linguistic variations. Specifically, we reformulate the contrastive loss used in CLAP architectures by introducing a multi-view contrastive learning objective, where paraphrases are treated as different views of the same audio scene and use this for training. Our proposed approach improves the text-to-audio retrieval performance of CLAP by 0.8%-13% across benchmarks and enhances…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMusic and Audio Processing
MethodsContrastive Learning
