Audio-FLAN: A Preliminary Release

Liumeng Xue; Ziya Zhou; Jiahao Pan; Zixuan Li; Shuai Fan; Yinghao Ma,; Sitong Cheng; Dongchao Yang; Haohan Guo; Yujia Xiao; Xinsheng Wang; Zixuan; Shen; Chuanbo Zhu; Xinshen Zhang; Tianchi Liu; Ruibin Yuan; Zeyue Tian; Haohe; Liu; Emmanouil Benetos; Ge Zhang; Yike Guo; Wei Xue

arXiv:2502.16584·cs.SD·February 25, 2025

Audio-FLAN: A Preliminary Release

Liumeng Xue, Ziya Zhou, Jiahao Pan, Zixuan Li, Shuai Fan, Yinghao Ma,, Sitong Cheng, Dongchao Yang, Haohan Guo, Yujia Xiao, Xinsheng Wang, Zixuan, Shen, Chuanbo Zhu, Xinshen Zhang, Tianchi Liu, Ruibin Yuan, Zeyue Tian, Haohe, Liu, Emmanouil Benetos, Ge Zhang, Yike Guo, Wei Xue

PDF

Open Access 1 Repo 1 Datasets

TL;DR

Audio-FLAN introduces a large-scale, diverse instruction-tuning dataset for audio tasks, enabling the development of unified models capable of understanding and generating speech, music, and sounds in a zero-shot setting.

Contribution

It provides the first comprehensive dataset covering both audio understanding and generation tasks, facilitating unified audio-language modeling.

Findings

01

Dataset includes over 100 million instances across 80 tasks.

02

Enables zero-shot learning for diverse audio tasks.

03

Supports development of unified audio-language models.

Abstract

Recent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLMs). However, audio understanding and generation are often treated as distinct tasks, hindering the development of truly unified audio-language models. While instruction tuning has demonstrated remarkable success in improving generalization and zero-shot learning across text and vision, its application to audio remains largely unexplored. A major obstacle is the lack of comprehensive datasets that unify audio understanding and generation. To address this, we introduce Audio-FLAN, a large-scale instruction-tuning dataset covering 80 diverse tasks across speech, music, and sound domains, with over 100 million instances. Audio-FLAN lays the foundation for unified audio-language models that can seamlessly handle both understanding (e.g., transcription,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lmxue/audio-flan
noneOfficial

Datasets

HKUSTAudio/Audio-FLAN-Dataset
dataset· 132k dl
132k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis