ViCA: Efficient Multimodal LLMs with Vision-Only Cross-Attention

Wenjie Liu; Hao Wu; Xin Qiu; Yingqi Fan; Yihan Zhang; Anhao Zhao; Yunpu Ma; Xiaoyu Shen

arXiv:2602.07574·cs.CV·February 10, 2026

ViCA: Efficient Multimodal LLMs with Vision-Only Cross-Attention

Wenjie Liu, Hao Wu, Xin Qiu, Yingqi Fan, Yihan Zhang, Anhao Zhao, Yunpu Ma, Xiaoyu Shen

PDF

Open Access

TL;DR

ViCA introduces a minimal multimodal language model architecture that significantly reduces visual processing overhead by using sparse cross-attention, maintaining high accuracy while greatly improving inference speed.

Contribution

The paper proposes ViCA, a novel architecture that bypasses dense visual self-attention, leading to more efficient multimodal LLMs with minimal accuracy loss.

Findings

01

Preserves 98% of baseline accuracy

02

Reduces visual-side computation to 4%

03

Achieves over 3.5x speedup in single-batch inference

Abstract

Modern multimodal large language models (MLLMs) adopt a unified self-attention design that processes visual and textual tokens at every Transformer layer, incurring substantial computational overhead. In this work, we revisit the necessity of such dense visual processing and show that projected visual embeddings are already well-aligned with the language space, while effective vision-language interaction occurs in only a small subset of layers. Based on these insights, we propose ViCA (Vision-only Cross-Attention), a minimal MLLM architecture in which visual tokens bypass all self-attention and feed-forward layers, interacting with text solely through sparse cross-attention at selected layers. Extensive evaluations across three MLLM backbones, nine multimodal benchmarks, and 26 pruning-based baselines show that ViCA preserves 98% of baseline accuracy while reducing visual-side…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Neural Network Applications