NExT-GPT: Any-to-Any Multimodal LLM

Shengqiong Wu; Hao Fei; Leigang Qu; Wei Ji; Tat-Seng Chua

arXiv:2309.05519·cs.AI·June 26, 2024·95 cites

NExT-GPT: Any-to-Any Multimodal LLM

Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, Tat-Seng Chua

PDF

Open Access 1 Repo 1 Models 3 Reviews

TL;DR

NExT-GPT is a versatile multimodal language model capable of understanding and generating content across text, images, videos, and audio, advancing towards human-like AI communication.

Contribution

The paper introduces an end-to-end any-to-any multimodal LLM system, NExT-GPT, with minimal parameter tuning and a new modality-switching instruction tuning method.

Findings

01

Enables arbitrary modality input and output combinations.

02

Achieves high cross-modal semantic understanding.

03

Facilitates low-cost training and easy modality expansion.

Abstract

While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides, they mostly fall prey to the limitation of only input-side multimodal understanding, without the ability to produce content in multiple modalities. As we humans always perceive the world and communicate with people through various modalities, developing any-to-any MM-LLMs capable of accepting and delivering content in any modality becomes essential to human-level AI. To fill the gap, we present an end-to-end general-purpose any-to-any MM-LLM system, NExT-GPT. We connect an LLM with multimodal adaptors and different diffusion decoders, enabling NExT-GPT to perceive inputs and generate outputs in arbitrary combinations of text, images, videos, and audio. By leveraging the existing well-trained highly-performing encoders and decoders, NExT-GPT is tuned with only a small amount of parameter (1%) of certain…

Peer Reviews

Decision·ICML 2024 Oral

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

- Extending the multi-modal LLMs free of limitation on the input/output modalities is an important research question that can facilitate a wider range of applications. - The introduced dataset, if made publically available, would be a good contribution to the community. - Various evaluation benchmarks are used to benchmark the proposed model with existing solutions. - The writing is clean and easy to follow

Weaknesses

- The proposed alignment learning technique is a bit naive and does not consider much about the challenge introduced by the any-to-any modality, such as how to balance the performance across different modalities. - Although introducing contents from different modalities during tuning is considered to improve the overall performance of the model, in the experiment section, it seems introducing these additional modalities actually leads to worse performance on benchmarking datasets. Does this ind

Reviewer 02Rating 5· marginally below the acceptance thresholdConfidence 5

Strengths

1. The system architecture is compact and includes multiple decoders for text, video, audio, and image generation, making it straightforward to implement. 2. The generation process is end-to-end and does not require initial text generation.

Weaknesses

1. The quality of the generation output is primarily dependent on the pre-trained generation modules. If these modules are flawed or produce errors, the system cannot rectify these issues. For instance, if in image generation, stable diffusion struggles with accurately rendering certain elements (e.g., the number of human fingers), NExT-GPT would not be able to produce an accurate output, irrespective of its understanding of the instruction. 2. The evaluation strategy appears questionable. It se

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

This paper proposes a novel approach to enable any-to-any generation by integrating off-the-shelf diffusion models with LLMs. The proposed approach to align the semantic tokens with outputs from text encoders of diffusion models seems efficient. The results look promising.

Weaknesses

The major concern I have regarding this paper is the training object during alignment, which is to align the semantic tokens with outputs from text encoders of diffusion models. This seems reasonable at first, but if the objective is to "match the semantics token with textual captions' representations from the text encoders of diffusion models", why not just directly use the diffusion model's text encoder to encode the textual captions? More specifically, why not just let the LLM output a captio

Code & Models

Repositories

NExT-GPT/NExT-GPT
pytorchOfficial

Models

🤗
osamaifti/NEXTGPT
model· ♡ 4
♡ 4

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsDiffusion