ViCA: Efficient Multimodal LLMs with Vision-Only Cross-Attention
Wenjie Liu, Hao Wu, Xin Qiu, Yingqi Fan, Yihan Zhang, Anhao Zhao, Yunpu Ma, Xiaoyu Shen

TL;DR
ViCA introduces a minimal multimodal language model architecture that significantly reduces visual processing overhead by using sparse cross-attention, maintaining high accuracy while greatly improving inference speed.
Contribution
The paper proposes ViCA, a novel architecture that bypasses dense visual self-attention, leading to more efficient multimodal LLMs with minimal accuracy loss.
Findings
Preserves 98% of baseline accuracy
Reduces visual-side computation to 4%
Achieves over 3.5x speedup in single-batch inference
Abstract
Modern multimodal large language models (MLLMs) adopt a unified self-attention design that processes visual and textual tokens at every Transformer layer, incurring substantial computational overhead. In this work, we revisit the necessity of such dense visual processing and show that projected visual embeddings are already well-aligned with the language space, while effective vision-language interaction occurs in only a small subset of layers. Based on these insights, we propose ViCA (Vision-only Cross-Attention), a minimal MLLM architecture in which visual tokens bypass all self-attention and feed-forward layers, interacting with text solely through sparse cross-attention at selected layers. Extensive evaluations across three MLLM backbones, nine multimodal benchmarks, and 26 pruning-based baselines show that ViCA preserves 98% of baseline accuracy while reducing visual-side…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Neural Network Applications
