On The Application of Linear Attention in Multimodal Transformers
Armin Gerami, Seyedehanita Madani, Ramani Duraiswami

TL;DR
This paper explores replacing traditional quadratic attention with Linear Attention in multimodal Transformers, significantly reducing computational costs while maintaining performance on large-scale vision-language tasks.
Contribution
It demonstrates that Linear Attention can be effectively integrated into multimodal Transformers, offering a scalable and efficient alternative without sacrificing accuracy.
Findings
Linear Attention reduces computational complexity from quadratic to linear.
Performance on ImageNet-21K zero-shot accuracy remains competitive.
Linear Attention follows the same scaling laws as softmax attention.
Abstract
Multimodal Transformers serve as the backbone for state-of-the-art vision-language models, yet their quadratic attention complexity remains a critical barrier to scalability. In this work, we investigate the viability of Linear Attention (LA) as a high-efficiency alternative within multimodal frameworks. By integrating LA, we reduce the computational overhead from quadratic to linear relative to sequence length while preserving competitive performance. We evaluate our approach across ViT-S/16, ViT-B/16, and ViT-L/16 architectures trained on the LAION-400M dataset, with validation focused on ImageNet-21K zero-shot accuracy. Our systematic evaluation demonstrates that Linear Attention not only yields significant computational savings but also adheres to the same scaling laws as standard softmax attention. These findings position Linear Attention as a robust, scalable solution for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
