PandaGPT: One Model To Instruction-Follow Them All

Yixuan Su; Tian Lan; Huayang Li; Jialu Xu; Yan Wang and; Deng Cai

arXiv:2305.16355·cs.CL·May 29, 2023·46 cites

PandaGPT: One Model To Instruction-Follow Them All

Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang and, Deng Cai

PDF

Open Access 1 Repo 1 Models

TL;DR

PandaGPT is a multimodal large language model that integrates visual and auditory inputs to perform complex, cross-modal tasks, demonstrating emergent capabilities across various data types with minimal aligned training data.

Contribution

PandaGPT combines ImageBind and Vicuna models to enable multimodal instruction-following with zero-shot cross-modal capabilities using only aligned image-text pairs.

Findings

01

Performs detailed image descriptions and video-inspired storytelling.

02

Handles multimodal inputs simultaneously for natural semantic composition.

03

Displays emergent cross-modal behaviors beyond image and text.

Abstract

We present PandaGPT, an approach to emPower large lANguage moDels with visual and Auditory instruction-following capabilities. Our pilot experiments show that PandaGPT can perform complex tasks such as detailed image description generation, writing stories inspired by videos, and answering questions about audios. More interestingly, PandaGPT can take multimodal inputs simultaneously and compose their semantics naturally. For example, PandaGPT can connect how objects look in an image/video and how they sound in an audio. To do so, PandaGPT combines the multimodal encoders from ImageBind and the large language models from Vicuna. Notably, only aligned image-text pairs are required for the training of PandaGPT. Thanks to the strong capability of ImageBind in embedding data from different modalities into the same space, PandaGPT displays emergent, i.e. zero-shot, cross-modal behaviors for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yxuansu/pandagpt
pytorch

Models

🤗
mvsoom/pandagpt-vicuna-v0-7b
model· ♡ 3
♡ 3

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling