Qwen-Audio: Advancing Universal Audio Understanding via Unified   Large-Scale Audio-Language Models

Yunfei Chu; Jin Xu; Xiaohuan Zhou; Qian Yang; Shiliang Zhang; Zhijie; Yan; Chang Zhou; Jingren Zhou

arXiv:2311.07919·eess.AS·December 22, 2023·21 cites

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie, Yan, Chang Zhou, Jingren Zhou

PDF

Open Access 2 Repos 10 Models

TL;DR

Qwen-Audio is a large-scale, unified audio-language model that supports over 30 diverse audio tasks and types, enabling universal audio understanding and multi-turn dialogue capabilities without task-specific fine-tuning.

Contribution

The paper introduces Qwen-Audio, a novel multi-task training framework with hierarchical tags that effectively scales audio-language pre-training across diverse tasks and audio types.

Findings

01

Outperforms existing models on multiple benchmark tasks.

02

Supports diverse audio types including speech, sounds, music, and songs.

03

Enables multi-turn audio and text dialogues with Qwen-Audio-Chat.

Abstract

Recently, instruction-following audio-language models have received broad attention for audio interaction with humans. However, the absence of pre-trained audio models capable of handling diverse audio types and tasks has hindered progress in this field. Consequently, most existing works have only been able to support a limited range of interaction capabilities. In this paper, we develop the Qwen-Audio model and address this limitation by scaling up audio-language pre-training to cover over 30 tasks and various audio types, such as human speech, natural sounds, music, and songs, to facilitate universal audio understanding abilities. However, directly co-training all tasks and datasets can lead to interference issues, as the textual labels associated with different datasets exhibit considerable variations due to differences in task focus, language, granularity of annotation, and text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing