RAVE: Re-Allocating Visual Attention in Large Multimodal Models

Xi Leng; Xinhong Ma; Ziqiang Dong; Feng Zhang; Xiaoying Tang; Yang Yang; Guanjun Jiang

arXiv:2605.18359·cs.CV·May 19, 2026

RAVE: Re-Allocating Visual Attention in Large Multimodal Models

Xi Leng, Xinhong Ma, Ziqiang Dong, Feng Zhang, Xiaoying Tang, Yang Yang, Guanjun Jiang

PDF

TL;DR

RAVE is a lightweight attention re-allocation method for large multimodal models that improves visual grounding and task performance without altering the model architecture.

Contribution

It introduces a novel pair-gating mechanism that enhances visual attention allocation in multimodal models, trained end-to-end without architectural changes.

Findings

01

RAVE improves multimodal benchmark scores by an average of 3 points.

02

Largest gains are on perception-intensive tasks like OCR, chart understanding, and VQA.

03

RAVE enhances visual grounding accuracy across diverse tasks.

Abstract

Large multimodal models (LMMs) inherit the self-attention mechanism of pretrained language backbones, yet standard attention can exhibit suboptimal allocation, including cross-modal misallocation between textual and visual evidence and intra-visual imbalance among visual tokens. We propose RAVE (Re-Allocating Visual Attention), a lightweight pair-gating mechanism that adds a learned query--key bias to pre-softmax attention scores over visual keys, derived from pre-RoPE query and key features. RAVE requires no architectural modification to the backbone and can be trained end-to-end with the rest of the model. Across a suite of multimodal benchmarks, RAVE improves over standard attention by an average of 3 points, with the largest gains on perception-intensive tasks -- including multilingual OCR, chart understanding, document VQA, and scene text VQA -- where accurate visual grounding is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.