Adapting LLaMA Decoder to Vision Transformer

Jiahao Wang; Wenqi Shao; Mengzhao Chen; Chengyue Wu; Yong; Liu; Taiqiang Wu; Kaipeng Zhang; Songyang Zhang; Kai Chen and; Ping Luo

arXiv:2404.06773·cs.CV·May 28, 2024·1 cites

Adapting LLaMA Decoder to Vision Transformer

Jiahao Wang, Wenqi Shao, Mengzhao Chen, Chengyue Wu, Yong, Liu, Taiqiang Wu, Kaipeng Zhang, Songyang Zhang, Kai Chen and, Ping Luo

PDF

Open Access 1 Repo

TL;DR

This paper adapts decoder-only Transformer architecture, originally for language, to vision tasks, creating iLLaMA, which achieves competitive accuracy and efficiency in image recognition and segmentation.

Contribution

It introduces a novel adaptation of LLaMA architecture for vision, including techniques to overcome training challenges and demonstrate competitive performance.

Findings

01

iLLaMA achieves 75.1% ImageNet top-1 accuracy with 5.7M parameters.

02

Scaling to 310M parameters improves accuracy to 86.0%.

03

iLLaMA shows reliable properties like shape-texture bias and good transfer learning.

Abstract

This work examines whether decoder-only Transformers such as LLaMA, which were originally designed for large language models (LLMs), can be adapted to the computer vision field. We first "LLaMAfy" a standard ViT step-by-step to align with LLaMA's architecture, and find that directly applying a causal mask to the self-attention brings an attention collapse issue, resulting in the failure to the network training. We suggest to reposition the class token behind the image tokens with a post-sequence class token technique to overcome this challenge, enabling causal self-attention to efficiently capture the entire image's information. Additionally, we develop a soft mask strategy that gradually introduces a causal mask to the self-attention at the onset of training to facilitate the optimization behavior. The tailored model, dubbed as image LLaMA (iLLaMA), is akin to LLaMA in architecture and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

techmonsterwang/illama
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications

MethodsALIGN · LLaMA