Adapting LLaMA Decoder to Vision Transformer
Jiahao Wang, Wenqi Shao, Mengzhao Chen, Chengyue Wu, Yong, Liu, Taiqiang Wu, Kaipeng Zhang, Songyang Zhang, Kai Chen and, Ping Luo

TL;DR
This paper adapts decoder-only Transformer architecture, originally for language, to vision tasks, creating iLLaMA, which achieves competitive accuracy and efficiency in image recognition and segmentation.
Contribution
It introduces a novel adaptation of LLaMA architecture for vision, including techniques to overcome training challenges and demonstrate competitive performance.
Findings
iLLaMA achieves 75.1% ImageNet top-1 accuracy with 5.7M parameters.
Scaling to 310M parameters improves accuracy to 86.0%.
iLLaMA shows reliable properties like shape-texture bias and good transfer learning.
Abstract
This work examines whether decoder-only Transformers such as LLaMA, which were originally designed for large language models (LLMs), can be adapted to the computer vision field. We first "LLaMAfy" a standard ViT step-by-step to align with LLaMA's architecture, and find that directly applying a causal mask to the self-attention brings an attention collapse issue, resulting in the failure to the network training. We suggest to reposition the class token behind the image tokens with a post-sequence class token technique to overcome this challenge, enabling causal self-attention to efficiently capture the entire image's information. Additionally, we develop a soft mask strategy that gradually introduces a causal mask to the self-attention at the onset of training to facilitate the optimization behavior. The tailored model, dubbed as image LLaMA (iLLaMA), is akin to LLaMA in architecture and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
MethodsALIGN · LLaMA
