ACAVCaps: Enabling large-scale training for fine-grained and diverse audio understanding

Yadong Niu; Tianzi Wang; Heinrich Dinkel; Xingwei Sun; Jiahao Zhou; Gang Li; Jizhong Liu; Junbo Zhang; Jian Luan

arXiv:2603.24038·eess.AS·March 26, 2026

ACAVCaps: Enabling large-scale training for fine-grained and diverse audio understanding

Yadong Niu, Tianzi Wang, Heinrich Dinkel, Xingwei Sun, Jiahao Zhou, Gang Li, Jizhong Liu, Junbo Zhang, Jian Luan

PDF

Open Access 1 Models

TL;DR

ACAVCaps is a large-scale, detailed audio captioning dataset that enhances the training of versatile audio-language models, leading to improved generalization across multiple audio understanding tasks.

Contribution

We introduce ACAVCaps, a novel large-scale, fine-grained audio captioning dataset created with a multi-expert pipeline, enabling better training of general audio understanding models.

Findings

01

Models trained on ACAVCaps show stronger generalization on downstream tasks.

02

ACAVCaps surpasses existing datasets in scale and descriptive detail.

03

The dataset is publicly available for research use.

Abstract

General audio understanding is a fundamental goal for large audio-language models, with audio captioning serving as a cornerstone task for their development. However, progress in this domain is hindered by existing datasets, which lack the scale and descriptive granularity required to train truly versatile models. To address this gap, we introduce ACAVCaps, a new large-scale, fine-grained, and multi-faceted audio captioning dataset. Derived from the ACAV100M collection, ACAVCaps is constructed using a multi-expert pipeline that analyzes audio from diverse perspectives-including speech, music, and acoustic properties-which are then synthesized into rich, detailed descriptions by a large language model. Experimental results demonstrate that models pre-trained on ACAVCaps exhibit substantially stronger generalization capabilities on various downstream tasks compared to those trained on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
mispeech/midashenglm-0.6b-fp32
model· 16 dl· ♡ 1
16 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing