Sound-VECaps: Improving Audio Generation with Visual Enhanced Captions

Yi Yuan; Dongya Jia; Xiaobin Zhuang; Yuanzhe Chen; Zhengxi Liu; Zhuo; Chen; Yuping Wang; Yuxuan Wang; Xubo Liu; Xiyuan Kang; Mark D. Plumbley,; Wenwu Wang

arXiv:2407.04416·cs.SD·January 3, 2025

Sound-VECaps: Improving Audio Generation with Visual Enhanced Captions

Yi Yuan, Dongya Jia, Xiaobin Zhuang, Yuanzhe Chen, Zhengxi Liu, Zhuo, Chen, Yuping Wang, Yuxuan Wang, Xubo Liu, Xiyuan Kang, Mark D. Plumbley,, Wenwu Wang

PDF

Open Access

TL;DR

This paper introduces Sound-VECaps, a large-scale dataset with detailed audio captions generated via an automated pipeline, significantly enhancing audio generation models' ability to handle complex prompts and advancing audio-text understanding.

Contribution

The creation of Sound-VECaps, a 1.66 million high-quality audio-caption dataset with enriched details, and demonstrating its effectiveness in improving text-to-audio generation models.

Findings

01

Training with Sound-VECaps improves performance on complex prompts.

02

Ablation studies show enhanced audio-text representation learning.

03

Dataset and models are publicly available online.

Abstract

Generative models have shown significant achievements in audio generation tasks. However, existing models struggle with complex and detailed prompts, leading to potential performance degradation. We hypothesize that this problem stems from the simplicity and scarcity of the training data. This work aims to create a large-scale audio dataset with rich captions for improving audio generation models. We first develop an automated pipeline to generate detailed captions by transforming predicted visual captions, audio captions, and tagging labels into comprehensive descriptions using a Large Language Model (LLM). The resulting dataset, Sound-VECaps, comprises 1.66M high-quality audio-caption pairs with enriched details including audio event orders, occurred places and environment information. We then demonstrate that training the text-to-audio generation models with Sound-VECaps…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSubtitles and Audiovisual Media · Video Analysis and Summarization · Music and Audio Processing