AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking   Head

Rongjie Huang; Mingze Li; Dongchao Yang; Jiatong Shi; Xuankai Chang,; Zhenhui Ye; Yuning Wu; Zhiqing Hong; Jiawei Huang; Jinglin Liu; Yi Ren; Zhou; Zhao; Shinji Watanabe

arXiv:2304.12995·cs.CL·April 26, 2023·21 cites

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang,, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, Yi Ren, Zhou, Zhao, Shinji Watanabe

PDF

Open Access 1 Repo

TL;DR

AudioGPT is a multi-modal AI system that enhances large language models with audio processing capabilities, enabling complex understanding and generation of speech, music, and sounds in multi-turn dialogues, advancing human-AI audio interactions.

Contribution

The paper introduces AudioGPT, a novel multi-modal system that integrates foundation models and speech interfaces with LLMs for comprehensive audio understanding and generation.

Findings

01

Demonstrates effective multi-turn dialogue capabilities with audio content

02

Shows robustness and consistency in audio understanding tasks

03

Enables creation of diverse audio content with ease

Abstract

Large language models (LLMs) have exhibited remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. Despite the recent success, current LLMs are not capable of processing complex audio information or conducting spoken conversations (like Siri or Alexa). In this work, we propose a multi-modal AI system named AudioGPT, which complements LLMs (i.e., ChatGPT) with 1) foundation models to process complex audio information and solve numerous understanding and generation tasks; and 2) the input/output interface (ASR, TTS) to support spoken dialogue. With an increasing demand to evaluate multi-modal LLMs of human intention understanding and cooperation with foundation models, we outline the principles and processes and test AudioGPT in terms of consistency, capability, and robustness. Experimental results demonstrate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aigc-audio/audiogpt
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Topic Modeling

MethodsTest