Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
Gen Luo, Yiyi Zhou, Tianhe Ren, Shengxin Chen, Xiaoshuai Sun, Rongrong, Ji

TL;DR
This paper introduces MMA, a lightweight and efficient method for adapting large language models to vision-language tasks, significantly reducing training costs while maintaining high performance.
Contribution
It proposes MMA, a novel adapter-based approach with a routing algorithm, enabling cost-effective multimodal adaptation of LLMs like LLaMA for vision-language tasks.
Findings
LaVIN achieves competitive performance in multimodal science question answering.
LaVIN demonstrates superior training efficiency compared to existing multimodal LLMs.
LaVIN requires only 1.4 training hours with 3.8M trainable parameters.
Abstract
Recently, growing interest has been aroused in extending the multimodal capability of large language models (LLMs), e.g., vision-language (VL) learning, which is regarded as the next milestone of artificial general intelligence. However, existing solutions are prohibitively expensive, which not only need to optimize excessive parameters, but also require another large-scale pre-training before VL instruction tuning. In this paper, we propose a novel and affordable solution for the effective VL adaption of LLMs, called Mixture-of-Modality Adaptation (MMA). Instead of using large neural networks to connect the image encoder and LLM, MMA adopts lightweight modules, i.e., adapters, to bridge the gap between LLMs and VL tasks, which also enables the joint optimization of the image and language models. Meanwhile, MMA is also equipped with a routing algorithm to help LLMs achieve an automatic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
