Remember Me: Bridging the Long-Range Gap in LVLMs with Three-Step Inference-Only Decay Resilience Strategies
Peng Gao, Yujian Lee, Xiaofeng Zhang, Zailong Chen, Hui Zhang

TL;DR
This paper introduces a three-step inference-only strategy to mitigate attention decay in LVLMs caused by Rotary Positional Encoding, significantly improving long-range dependency modeling without additional training.
Contribution
It proposes T-DRS, a novel inference-only approach with three modules to recover long-range token dependencies in LVLMs, enhancing global context understanding.
Findings
Consistently improves VQA benchmark performance
Enhances long-range dependency modeling without retraining
Effective across multiple multimodal tasks
Abstract
Large Vision-Language Models (LVLMs) have achieved impressive performance across a wide range of multimodal tasks. However, they still face critical challenges in modeling long-range dependencies under the usage of Rotary Positional Encoding (ROPE). Although it can facilitate precise modeling of token positions, it induces progressive attention decay as token distance increases, especially with progressive attention decay over distant token pairs, which severely impairs the model's ability to remember global context. To alleviate this issue, we propose inference-only Three-step Decay Resilience Strategies (T-DRS), comprising (1) Semantic-Driven DRS (SD-DRS), amplifying semantically meaningful but distant signals via content-aware residuals, (2) Distance-aware Control DRS (DC-DRS), which can purify attention by smoothly modulating weights based on positional distances, suppressing noise…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
