DiveSound: LLM-Assisted Automatic Taxonomy Construction for Diverse   Audio Generation

Baihan Li; Zeyu Xie; Xuenan Xu; Yiwei Guo; Ming Yan; Ji Zhang; Kai Yu,; Mengyue Wu

arXiv:2407.13198·cs.SD·July 19, 2024

DiveSound: LLM-Assisted Automatic Taxonomy Construction for Diverse Audio Generation

Baihan Li, Zeyu Xie, Xuenan Xu, Yiwei Guo, Ming Yan, Ji Zhang, Kai Yu,, Mengyue Wu

PDF

Open Access

TL;DR

DiveSound introduces a framework using large language models and multimodal data to systematically construct diverse audio datasets, significantly improving diversity in audio generation tasks.

Contribution

We propose DiveSound, a scalable, autonomous framework leveraging multimodal contrastive representations for systematic sound class diversity construction with LLM assistance.

Findings

01

Enhanced diversity in audio generation with visual guidance

02

Constructed a multimodal dataset with detailed sound class subcategories

03

Demonstrated substantial diversity improvements in text-to-audio tasks

Abstract

Audio generation has attracted significant attention. Despite remarkable enhancement in audio quality, existing models overlook diversity evaluation. This is partially due to the lack of a systematic sound class diversity framework and a matching dataset. To address these issues, we propose DiveSound, a novel framework for constructing multimodal datasets with in-class diversified taxonomy, assisted by large language models. As both textual and visual information can be utilized to guide diverse generation, DiveSound leverages multimodal contrastive representations in data construction. Our framework is highly autonomous and can be easily scaled up. We provide a textaudio-image aligned diversity dataset whose sound event class tags have an average of 2.42 subcategories. Text-to-audio experiments on the constructed dataset show a substantial increase of diversity with the help of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing