AudioSetCaps: An Enriched Audio-Caption Dataset using Automated   Generation Pipeline with Large Audio and Language Models

Jisheng Bai; Haohe Liu; Mou Wang; Dongyuan Shi; Wenwu Wang; Mark D.; Plumbley; Woon-Seng Gan; Jianfeng Chen

arXiv:2411.18953·eess.AS·December 2, 2024

AudioSetCaps: An Enriched Audio-Caption Dataset using Automated Generation Pipeline with Large Audio and Language Models

Jisheng Bai, Haohe Liu, Mou Wang, Dongyuan Shi, Wenwu Wang, Mark D., Plumbley, Woon-Seng Gan, Jianfeng Chen

PDF

Open Access 1 Repo

TL;DR

This paper introduces AudioSetCaps, a large-scale audio-caption dataset created using an automated pipeline that combines audio-language models, large language models, and contrastive learning to generate high-quality, fine-grained audio-text pairs for improved audio-language tasks.

Contribution

The paper presents a novel automated pipeline for generating large-scale, high-quality audio-caption datasets leveraging advanced models, significantly expanding available resources for audio-language research.

Findings

01

AudioSetCaps contains 1.9 million audio-caption pairs.

02

Models trained on AudioSetCaps achieve state-of-the-art retrieval scores.

03

The pipeline and datasets are publicly available for research use.

Abstract

With the emergence of audio-language models, constructing large-scale paired audio-language datasets has become essential yet challenging for model development, primarily due to the time-intensive and labour-heavy demands involved. While large language models (LLMs) have improved the efficiency of synthetic audio caption generation, current approaches struggle to effectively extract and incorporate detailed audio information. In this paper, we propose an automated pipeline that integrates audio-language models for fine-grained content extraction, LLMs for synthetic caption generation, and a contrastive language-audio pretraining (CLAP) model-based refinement process to improve the quality of captions. Specifically, we employ prompt chaining techniques in the content extraction stage to obtain accurate and fine-grained audio information, while we use the refinement process to mitigate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jishengbai/audiosetcaps
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing