Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large   Language Models

Gen Luo; Yiyi Zhou; Tianhe Ren; Shengxin Chen; Xiaoshuai Sun; Rongrong; Ji

arXiv:2305.15023·cs.CV·October 25, 2023·45 cites

Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models

Gen Luo, Yiyi Zhou, Tianhe Ren, Shengxin Chen, Xiaoshuai Sun, Rongrong, Ji

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces MMA, a lightweight and efficient method for adapting large language models to vision-language tasks, significantly reducing training costs while maintaining high performance.

Contribution

It proposes MMA, a novel adapter-based approach with a routing algorithm, enabling cost-effective multimodal adaptation of LLMs like LLaMA for vision-language tasks.

Findings

01

LaVIN achieves competitive performance in multimodal science question answering.

02

LaVIN demonstrates superior training efficiency compared to existing multimodal LLMs.

03

LaVIN requires only 1.4 training hours with 3.8M trainable parameters.

Abstract

Recently, growing interest has been aroused in extending the multimodal capability of large language models (LLMs), e.g., vision-language (VL) learning, which is regarded as the next milestone of artificial general intelligence. However, existing solutions are prohibitively expensive, which not only need to optimize excessive parameters, but also require another large-scale pre-training before VL instruction tuning. In this paper, we propose a novel and affordable solution for the effective VL adaption of LLMs, called Mixture-of-Modality Adaptation (MMA). Instead of using large neural networks to connect the image encoder and LLM, MMA adopts lightweight modules, i.e., adapters, to bridge the gap between LLMs and VL tasks, which also enables the joint optimization of the image and language models. Meanwhile, MMA is also equipped with a routing algorithm to help LLMs achieve an automatic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

luogen1996/lavin
pytorch

Videos

Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques