SAISA: Towards Multimodal Large Language Models with Both Training and Inference Efficiency
Qianhao Yuan, Yanjiang Liu, Yaojie Lu, Hongyu Lin, Ben He, Xianpei, Han, Le Sun

TL;DR
SAISA introduces a novel architecture for multimodal large language models that significantly improves training and inference efficiency by eliminating redundant visual token attention, achieving 66% FLOPs reduction and 26% training savings.
Contribution
The paper proposes SAISA, a new architecture that enhances efficiency by aligning visual features directly with input space, and introduces NAAViT, a self-attention mechanism that removes visual token attention.
Findings
Redundant attention among visual tokens identified.
SAISA reduces inference FLOPs by 66%.
SAISA cuts training costs by 26%.
Abstract
Multimodal Large Language Models (MLLMs) mainly fall into two architectures, each involving a trade-off between training and inference efficiency: embedding space alignment (e.g., LLaVA-1.5) is inefficient during inference, while cross-attention space alignment (e.g., Flamingo) is inefficient in training. In this paper, we compare these two architectures and identify the key factors for building efficient MLLMs. A primary difference between them lies in how attention is applied to visual tokens, particularly in their interactions with each other. To investigate whether attention among visual tokens is necessary, we propose a new self-attention mechanism, NAAViT (\textbf{N}o \textbf{A}ttention \textbf{A}mong \textbf{Vi}sual \textbf{T}okens), which eliminates this type of attention. Our pilot experiment on LLaVA-1.5 shows that attention among visual tokens is highly redundant. Based on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
MethodsSoftmax · Attention Is All You Need
