AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang,, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, Yi Ren, Zhou, Zhao, Shinji Watanabe

TL;DR
AudioGPT is a multi-modal AI system that enhances large language models with audio processing capabilities, enabling complex understanding and generation of speech, music, and sounds in multi-turn dialogues, advancing human-AI audio interactions.
Contribution
The paper introduces AudioGPT, a novel multi-modal system that integrates foundation models and speech interfaces with LLMs for comprehensive audio understanding and generation.
Findings
Demonstrates effective multi-turn dialogue capabilities with audio content
Shows robustness and consistency in audio understanding tasks
Enables creation of diverse audio content with ease
Abstract
Large language models (LLMs) have exhibited remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. Despite the recent success, current LLMs are not capable of processing complex audio information or conducting spoken conversations (like Siri or Alexa). In this work, we propose a multi-modal AI system named AudioGPT, which complements LLMs (i.e., ChatGPT) with 1) foundation models to process complex audio information and solve numerous understanding and generation tasks; and 2) the input/output interface (ASR, TTS) to support spoken dialogue. With an increasing demand to evaluate multi-modal LLMs of human intention understanding and cooperation with foundation models, we outline the principles and processes and test AudioGPT in terms of consistency, capability, and robustness. Experimental results demonstrate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Topic Modeling
MethodsTest
