LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Peng Gao; Jiaming Han; Renrui Zhang; Ziyi Lin; Shijie Geng; Aojun; Zhou; Wei Zhang; Pan Lu; Conghui He; Xiangyu Yue; Hongsheng Li; Yu Qiao

arXiv:2304.15010·cs.CV·May 1, 2023·117 cites

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun, Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, Hongsheng Li, Yu Qiao

PDF

Open Access 3 Repos 1 Models

TL;DR

LLaMA-Adapter V2 is a parameter-efficient model that enhances multi-modal reasoning and instruction-following capabilities of LLaMA by unlocking more parameters, early visual token fusion, and joint training with image-text data.

Contribution

It introduces a novel, efficient framework that improves visual instruction understanding and multi-modal reasoning with minimal additional parameters.

Findings

01

Outperforms previous LLaMA-Adapter in open-ended multi-modal tasks

02

Achieves strong instruction-following with only 14M extra parameters

03

Enhances language-only instruction capabilities and chat interactions

Abstract

How to efficiently transform large language models (LLMs) into instruction followers is recently a popular research direction, while training LLM for multi-modal reasoning remains less explored. Although the recent LLaMA-Adapter demonstrates the potential to handle visual inputs with LLMs, it still cannot generalize well to open-ended visual instructions and lags behind GPT-4. In this paper, we present LLaMA-Adapter V2, a parameter-efficient visual instruction model. Specifically, we first augment LLaMA-Adapter by unlocking more learnable parameters (e.g., norm, bias and scale), which distribute the instruction-following ability across the entire LLaMA model besides adapters. Secondly, we propose an early fusion strategy to feed visual tokens only into the early LLM layers, contributing to better visual knowledge incorporation. Thirdly, a joint training paradigm of image-text pairs and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
Alpha-VLLM/LLaMA2-Accessory
model· ♡ 38
♡ 38

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling

MethodsMulti-Head Attention · Attention Is All You Need · Adam · Layer Normalization · Linear Layer · Label Smoothing · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Residual Connection