AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

Jun Zhan; Junqi Dai; Jiasheng Ye; Yunhua Zhou; Dong Zhang; Zhigeng Liu; Xin Zhang; Ruibin Yuan; Ge Zhang; Linyang Li; Hang Yan; Jie Fu; Tao Gui; Tianxiang Sun; Yu-Gang Jiang; Xipeng Qiu

arXiv:2402.12226·cs.CL·September 9, 2025·3 cites

AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, Hang Yan, Jie Fu, Tao Gui, Tianxiang Sun, Yu-Gang Jiang, Xipeng Qiu

PDF

Open Access 1 Repo 2 Models 1 Datasets 1 Video

TL;DR

AnyGPT is a versatile multimodal language model that uses discrete representations to process speech, text, images, and music seamlessly, enabling flexible multimodal interactions without altering existing LLM architectures.

Contribution

It introduces a unified discrete representation approach for multimodal processing, allowing integration of new modalities through data preprocessing without changing LLM structures.

Findings

01

Achieves comparable performance to specialized models across modalities

02

Supports any-to-any multimodal conversations

03

Facilitates seamless addition of new modalities

Abstract

We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities, including speech, text, images, and music. AnyGPT can be trained stably without any alterations to the current large language model (LLM) architecture or training paradigms. Instead, it relies exclusively on data-level preprocessing, facilitating the seamless integration of new modalities into LLMs, akin to the incorporation of new languages. We build a multimodal text-centric dataset for multimodal alignment pre-training. Utilizing generative models, we synthesize the first large-scale any-to-any multimodal instruction dataset. It consists of 108k samples of multi-turn conversations that intricately interweave various modalities, thus equipping the model to handle arbitrary combinations of multimodal inputs and outputs. Experimental…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

OpenMOSS/AnyGPT
pytorch

Models

Datasets

OpenMOSS-Team/AnyInstruct
dataset· 155 dl
155 dl

Videos

AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis