PandaGPT: One Model To Instruction-Follow Them All
Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang and, Deng Cai

TL;DR
PandaGPT is a multimodal large language model that integrates visual and auditory inputs to perform complex, cross-modal tasks, demonstrating emergent capabilities across various data types with minimal aligned training data.
Contribution
PandaGPT combines ImageBind and Vicuna models to enable multimodal instruction-following with zero-shot cross-modal capabilities using only aligned image-text pairs.
Findings
Performs detailed image descriptions and video-inspired storytelling.
Handles multimodal inputs simultaneously for natural semantic composition.
Displays emergent cross-modal behaviors beyond image and text.
Abstract
We present PandaGPT, an approach to emPower large lANguage moDels with visual and Auditory instruction-following capabilities. Our pilot experiments show that PandaGPT can perform complex tasks such as detailed image description generation, writing stories inspired by videos, and answering questions about audios. More interestingly, PandaGPT can take multimodal inputs simultaneously and compose their semantics naturally. For example, PandaGPT can connect how objects look in an image/video and how they sound in an audio. To do so, PandaGPT combines the multimodal encoders from ImageBind and the large language models from Vicuna. Notably, only aligned image-text pairs are required for the training of PandaGPT. Thanks to the strong capability of ImageBind in embedding data from different modalities into the same space, PandaGPT displays emergent, i.e. zero-shot, cross-modal behaviors for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
