AudioSetMix: Enhancing Audio-Language Datasets with LLM-Assisted Augmentations
David Xu

TL;DR
AudioSetMix introduces a scalable, LLM-assisted data augmentation method that enhances audio-language datasets, leading to improved model performance and addressing data quality and diversity limitations in the domain.
Contribution
The paper presents a novel LLM-based augmentation technique to generate high-quality, diverse audio-caption pairs, significantly improving dataset size and quality for audio-language learning.
Findings
Improved model performance on multiple benchmarks.
Addresses lack of modifiers in existing datasets.
Achieves state-of-the-art results with augmented data.
Abstract
Multi-modal learning in the audio-language domain has seen significant advancements in recent years. However, audio-language learning faces challenges due to limited and lower-quality data compared to image-language tasks. Existing audio-language datasets are notably smaller, and manual labeling is hindered by the need to listen to entire audio clips for accurate labeling. Our method systematically generates audio-caption pairs by augmenting audio clips with natural language labels and corresponding audio signal processing operations. Leveraging a Large Language Model, we generate descriptions of augmented audio clips with a prompt template. This scalable method produces AudioSetMix, a high-quality training dataset for text-and-audio related models. Integration of our dataset improves models performance on benchmarks by providing diversified and better-aligned examples. Notably, our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Diverse Musicological Studies
