LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun, Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, Hongsheng Li, Yu Qiao

TL;DR
LLaMA-Adapter V2 is a parameter-efficient model that enhances multi-modal reasoning and instruction-following capabilities of LLaMA by unlocking more parameters, early visual token fusion, and joint training with image-text data.
Contribution
It introduces a novel, efficient framework that improves visual instruction understanding and multi-modal reasoning with minimal additional parameters.
Findings
Outperforms previous LLaMA-Adapter in open-ended multi-modal tasks
Achieves strong instruction-following with only 14M extra parameters
Enhances language-only instruction capabilities and chat interactions
Abstract
How to efficiently transform large language models (LLMs) into instruction followers is recently a popular research direction, while training LLM for multi-modal reasoning remains less explored. Although the recent LLaMA-Adapter demonstrates the potential to handle visual inputs with LLMs, it still cannot generalize well to open-ended visual instructions and lags behind GPT-4. In this paper, we present LLaMA-Adapter V2, a parameter-efficient visual instruction model. Specifically, we first augment LLaMA-Adapter by unlocking more learnable parameters (e.g., norm, bias and scale), which distribute the instruction-following ability across the entire LLaMA model besides adapters. Secondly, we propose an early fusion strategy to feed visual tokens only into the early LLM layers, contributing to better visual knowledge incorporation. Thirdly, a joint training paradigm of image-text pairs and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
MethodsMulti-Head Attention · Attention Is All You Need · Adam · Layer Normalization · Linear Layer · Label Smoothing · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Residual Connection
