ImageBind-LLM: Multi-modality Instruction Tuning

Jiaming Han; Renrui Zhang; Wenqi Shao; Peng Gao; Peng Xu; Han Xiao,; Kaipeng Zhang; Chris Liu; Song Wen; Ziyu Guo; Xudong Lu; Shuai Ren; Yafei; Wen; Xiaoxin Chen; Xiangyu Yue; Hongsheng Li; Yu Qiao

arXiv:2309.03905·cs.MM·September 13, 2023·24 cites

ImageBind-LLM: Multi-modality Instruction Tuning

Jiaming Han, Renrui Zhang, Wenqi Shao, Peng Gao, Peng Xu, Han Xiao,, Kaipeng Zhang, Chris Liu, Song Wen, Ziyu Guo, Xudong Lu, Shuai Ren, Yafei, Wen, Xiaoxin Chen, Xiangyu Yue, Hongsheng Li, Yu Qiao

PDF

Open Access 2 Repos 1 Models

TL;DR

ImageBind-LLM introduces a multi-modality instruction tuning method for large language models, enabling them to respond to diverse modalities like audio, 3D, and video through a novel embedding alignment and visual cache system.

Contribution

The paper presents a new multi-modality instruction tuning approach for LLMs using ImageBind, allowing responses to multiple modalities with a simple training process and a training-free cache for enhanced performance.

Findings

01

Model responds effectively to diverse modalities.

02

Achieves superior multi-modality instruction-following.

03

Demonstrates high language generation quality across modalities.

Abstract

We present ImageBind-LLM, a multi-modality instruction tuning method of large language models (LLMs) via ImageBind. Existing works mainly focus on language and image instruction tuning, different from which, our ImageBind-LLM can respond to multi-modality conditions, including audio, 3D point clouds, video, and their embedding-space arithmetic by only image-text alignment training. During training, we adopt a learnable bind network to align the embedding space between LLaMA and ImageBind's image encoder. Then, the image features transformed by the bind network are added to word tokens of all layers in LLaMA, which progressively injects visual instructions via an attention-free and zero-initialized gating mechanism. Aided by the joint embedding of ImageBind, the simple image-text training enables our model to exhibit superior multi-modality instruction-following capabilities. During…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
mlfu7/Touch-Vision-Language-Models
model· ♡ 8
♡ 8

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling

MethodsFocus · ALIGN