TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild
Huayang Li, Siheng Li, Deng Cai, Longyue Wang, Lemao Liu, and Taro Watanabe, Yujiu Yang, Shuming Shi

TL;DR
TextBind is a novel framework that enables large language models to perform multi-turn interleaved multimodal instruction-following with minimal annotation, using only image-caption pairs to generate complex multimodal conversations.
Contribution
We introduce TextBind, an almost annotation-free method that leverages a new architecture and dataset to enhance multimodal instruction-following capabilities in large language models.
Findings
Successfully generates multi-turn multimodal conversations from minimal data.
Introduces MIM architecture for seamless integration of image and text processing.
Provides dataset, model, and demo to support future research in multimodal instruction-following.
Abstract
Large language models with instruction-following abilities have revolutionized the field of artificial intelligence. These models show exceptional generalizability to tackle various real-world tasks through their natural language interfaces. However, their performance heavily relies on high-quality exemplar data, which is often difficult to obtain. This challenge is further exacerbated when it comes to multimodal instruction following. We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved multimodal instruction-following capabilities. Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model. To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
MethodsMutual Information Machine/Mask Image Modeling
