Auto-ACD: A Large-scale Dataset for Audio-Language Representation   Learning

Luoyi Sun; Xuenan Xu; Mengyue Wu; Weidi Xie

arXiv:2309.11500·cs.SD·September 10, 2024

Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning

Luoyi Sun, Xuenan Xu, Mengyue Wu, Weidi Xie

PDF

Open Access 1 Datasets

TL;DR

Auto-ACD is a large-scale, high-quality audio-language dataset created using an innovative automatic approach that leverages multimodal inputs and large language models, significantly advancing audio representation learning and downstream task performance.

Contribution

The paper introduces Auto-ACD, a novel large-scale audio-language dataset generated automatically using multimodal data and LLMs, addressing limitations of existing datasets in size and quality.

Findings

01

Models trained on Auto-ACD show improved performance on downstream tasks.

02

Auto-ACD enables effective zero-shot classification and audio captioning.

03

The dataset facilitates the establishment of a new benchmark for audio-text tasks.

Abstract

Recently, the AI community has made significant strides in developing powerful foundation models, driven by large-scale multimodal datasets. However, for audio representation learning, existing datasets suffer from limitations in the following aspects: insufficient volume, simplistic content, and arduous collection procedures. To establish an audio dataset with high-quality captions, we propose an innovative, automatic approach leveraging multimodal inputs, such as video frames, audio streams. Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs. We exploit a series of pre-trained models or APIs, to determine audio-visual synchronisation, generate image captions, object detection, or audio tags for specific videos. Subsequently, we employ LLM to paraphrase a congruent caption for each audio, guided by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Loie/Auto-ACD
dataset· 157 dl
157 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing