Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning
Luoyi Sun, Xuenan Xu, Mengyue Wu, Weidi Xie

TL;DR
Auto-ACD is a large-scale, high-quality audio-language dataset created using an innovative automatic approach that leverages multimodal inputs and large language models, significantly advancing audio representation learning and downstream task performance.
Contribution
The paper introduces Auto-ACD, a novel large-scale audio-language dataset generated automatically using multimodal data and LLMs, addressing limitations of existing datasets in size and quality.
Findings
Models trained on Auto-ACD show improved performance on downstream tasks.
Auto-ACD enables effective zero-shot classification and audio captioning.
The dataset facilitates the establishment of a new benchmark for audio-text tasks.
Abstract
Recently, the AI community has made significant strides in developing powerful foundation models, driven by large-scale multimodal datasets. However, for audio representation learning, existing datasets suffer from limitations in the following aspects: insufficient volume, simplistic content, and arduous collection procedures. To establish an audio dataset with high-quality captions, we propose an innovative, automatic approach leveraging multimodal inputs, such as video frames, audio streams. Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs. We exploit a series of pre-trained models or APIs, to determine audio-visual synchronisation, generate image captions, object detection, or audio tags for specific videos. Subsequently, we employ LLM to paraphrase a congruent caption for each audio, guided by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
