ImageBind-LLM: Multi-modality Instruction Tuning
Jiaming Han, Renrui Zhang, Wenqi Shao, Peng Gao, Peng Xu, Han Xiao,, Kaipeng Zhang, Chris Liu, Song Wen, Ziyu Guo, Xudong Lu, Shuai Ren, Yafei, Wen, Xiaoxin Chen, Xiangyu Yue, Hongsheng Li, Yu Qiao

TL;DR
ImageBind-LLM introduces a multi-modality instruction tuning method for large language models, enabling them to respond to diverse modalities like audio, 3D, and video through a novel embedding alignment and visual cache system.
Contribution
The paper presents a new multi-modality instruction tuning approach for LLMs using ImageBind, allowing responses to multiple modalities with a simple training process and a training-free cache for enhanced performance.
Findings
Model responds effectively to diverse modalities.
Achieves superior multi-modality instruction-following.
Demonstrates high language generation quality across modalities.
Abstract
We present ImageBind-LLM, a multi-modality instruction tuning method of large language models (LLMs) via ImageBind. Existing works mainly focus on language and image instruction tuning, different from which, our ImageBind-LLM can respond to multi-modality conditions, including audio, 3D point clouds, video, and their embedding-space arithmetic by only image-text alignment training. During training, we adopt a learnable bind network to align the embedding space between LLaMA and ImageBind's image encoder. Then, the image features transformed by the bind network are added to word tokens of all layers in LLaMA, which progressively injects visual instructions via an attention-free and zero-initialized gating mechanism. Aided by the joint embedding of ImageBind, the simple image-text training enables our model to exhibit superior multi-modality instruction-following capabilities. During…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
MethodsFocus · ALIGN
