Sound-VECaps: Improving Audio Generation with Visual Enhanced Captions
Yi Yuan, Dongya Jia, Xiaobin Zhuang, Yuanzhe Chen, Zhengxi Liu, Zhuo, Chen, Yuping Wang, Yuxuan Wang, Xubo Liu, Xiyuan Kang, Mark D. Plumbley,, Wenwu Wang

TL;DR
This paper introduces Sound-VECaps, a large-scale dataset with detailed audio captions generated via an automated pipeline, significantly enhancing audio generation models' ability to handle complex prompts and advancing audio-text understanding.
Contribution
The creation of Sound-VECaps, a 1.66 million high-quality audio-caption dataset with enriched details, and demonstrating its effectiveness in improving text-to-audio generation models.
Findings
Training with Sound-VECaps improves performance on complex prompts.
Ablation studies show enhanced audio-text representation learning.
Dataset and models are publicly available online.
Abstract
Generative models have shown significant achievements in audio generation tasks. However, existing models struggle with complex and detailed prompts, leading to potential performance degradation. We hypothesize that this problem stems from the simplicity and scarcity of the training data. This work aims to create a large-scale audio dataset with rich captions for improving audio generation models. We first develop an automated pipeline to generate detailed captions by transforming predicted visual captions, audio captions, and tagging labels into comprehensive descriptions using a Large Language Model (LLM). The resulting dataset, Sound-VECaps, comprises 1.66M high-quality audio-caption pairs with enriched details including audio event orders, occurred places and environment information. We then demonstrate that training the text-to-audio generation models with Sound-VECaps…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSubtitles and Audiovisual Media · Video Analysis and Summarization · Music and Audio Processing
