SAISA: Towards Multimodal Large Language Models with Both Training and   Inference Efficiency

Qianhao Yuan; Yanjiang Liu; Yaojie Lu; Hongyu Lin; Ben He; Xianpei; Han; Le Sun

arXiv:2502.02458·cs.CL·February 5, 2025

SAISA: Towards Multimodal Large Language Models with Both Training and Inference Efficiency

Qianhao Yuan, Yanjiang Liu, Yaojie Lu, Hongyu Lin, Ben He, Xianpei, Han, Le Sun

PDF

Open Access 1 Repo

TL;DR

SAISA introduces a novel architecture for multimodal large language models that significantly improves training and inference efficiency by eliminating redundant visual token attention, achieving 66% FLOPs reduction and 26% training savings.

Contribution

The paper proposes SAISA, a new architecture that enhances efficiency by aligning visual features directly with input space, and introduces NAAViT, a self-attention mechanism that removes visual token attention.

Findings

01

Redundant attention among visual tokens identified.

02

SAISA reduces inference FLOPs by 66%.

03

SAISA cuts training costs by 26%.

Abstract

Multimodal Large Language Models (MLLMs) mainly fall into two architectures, each involving a trade-off between training and inference efficiency: embedding space alignment (e.g., LLaVA-1.5) is inefficient during inference, while cross-attention space alignment (e.g., Flamingo) is inefficient in training. In this paper, we compare these two architectures and identify the key factors for building efficient MLLMs. A primary difference between them lies in how attention is applied to visual tokens, particularly in their interactions with each other. To investigate whether attention among visual tokens is necessary, we propose a new self-attention mechanism, NAAViT (\textbf{N}o \textbf{A}ttention \textbf{A}mong \textbf{Vi}sual \textbf{T}okens), which eliminates this type of attention. Our pilot experiment on LLaVA-1.5 shows that attention among visual tokens is highly redundant. Based on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

icip-cas/saisa
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems

MethodsSoftmax · Attention Is All You Need