AudioSetCaps: An Enriched Audio-Caption Dataset using Automated Generation Pipeline with Large Audio and Language Models
Jisheng Bai, Haohe Liu, Mou Wang, Dongyuan Shi, Wenwu Wang, Mark D., Plumbley, Woon-Seng Gan, Jianfeng Chen

TL;DR
This paper introduces AudioSetCaps, a large-scale audio-caption dataset created using an automated pipeline that combines audio-language models, large language models, and contrastive learning to generate high-quality, fine-grained audio-text pairs for improved audio-language tasks.
Contribution
The paper presents a novel automated pipeline for generating large-scale, high-quality audio-caption datasets leveraging advanced models, significantly expanding available resources for audio-language research.
Findings
AudioSetCaps contains 1.9 million audio-caption pairs.
Models trained on AudioSetCaps achieve state-of-the-art retrieval scores.
The pipeline and datasets are publicly available for research use.
Abstract
With the emergence of audio-language models, constructing large-scale paired audio-language datasets has become essential yet challenging for model development, primarily due to the time-intensive and labour-heavy demands involved. While large language models (LLMs) have improved the efficiency of synthetic audio caption generation, current approaches struggle to effectively extract and incorporate detailed audio information. In this paper, we propose an automated pipeline that integrates audio-language models for fine-grained content extraction, LLMs for synthetic caption generation, and a contrastive language-audio pretraining (CLAP) model-based refinement process to improve the quality of captions. Specifically, we employ prompt chaining techniques in the content extraction stage to obtain accurate and fine-grained audio information, while we use the refinement process to mitigate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
