Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie, Yan, Chang Zhou, Jingren Zhou

TL;DR
Qwen-Audio is a large-scale, unified audio-language model that supports over 30 diverse audio tasks and types, enabling universal audio understanding and multi-turn dialogue capabilities without task-specific fine-tuning.
Contribution
The paper introduces Qwen-Audio, a novel multi-task training framework with hierarchical tags that effectively scales audio-language pre-training across diverse tasks and audio types.
Findings
Outperforms existing models on multiple benchmark tasks.
Supports diverse audio types including speech, sounds, music, and songs.
Enables multi-turn audio and text dialogues with Qwen-Audio-Chat.
Abstract
Recently, instruction-following audio-language models have received broad attention for audio interaction with humans. However, the absence of pre-trained audio models capable of handling diverse audio types and tasks has hindered progress in this field. Consequently, most existing works have only been able to support a limited range of interaction capabilities. In this paper, we develop the Qwen-Audio model and address this limitation by scaling up audio-language pre-training to cover over 30 tasks and various audio types, such as human speech, natural sounds, music, and songs, to facilitate universal audio understanding abilities. However, directly co-training all tasks and datasets can lead to interference issues, as the textual labels associated with different datasets exhibit considerable variations due to differences in task focus, language, granularity of annotation, and text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Qwen/Qwen-Audiomodel· 1.8k dl· ♡ 1471.8k dl♡ 147
- 🤗Qwen/Qwen-Audio-Chatmodel· 2.5k dl· ♡ 952.5k dl♡ 95
- 🤗4bit/Qwen-Audio-Chatmodel· 10 dl10 dl
- 🤗xun/Qwen-Audio-Chat-Int4model· 6 dl· ♡ 46 dl♡ 4
- 🤗Ostixe360/Qwen-Audio-nf4model· 6 dl· ♡ 16 dl♡ 1
- 🤗Qwen/Qwen2-Audio-7Bmodel· 5.6k dl· ♡ 1655.6k dl♡ 165
- 🤗Qwen/Qwen2-Audio-7B-Instructmodel· 360k dl· ♡ 526360k dl♡ 526
- 🤗thucdangvan020999/qwen-audio-new-taskmodel· 3 dl3 dl
- 🤗Sergei6000/Qwen2-Audio-7B-Instruct-Int4model· 142 dl· ♡ 8142 dl♡ 8
- 🤗thucdangvan020999/qwen-audio-chat_newtaskmodel· 2 dl2 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
