Beyond Sequential Distance: Inter-Modal Distance Invariant Position Encoding
Lin Chen, Bolin Ni, Qi Yang, Zili Wang, Kun Ding, Ying Wang, Houwen Peng, Shiming Xiang

TL;DR
This paper introduces DIPE, a position encoding method that maintains visual grounding in multimodal models over long contexts by addressing inter-modal distance biases, improving long-term visual consistency.
Contribution
The paper proposes DIPE, a novel position encoding technique that disentangles intra- and inter-modal interactions to mitigate visual fading in long-context multimodal models.
Findings
DIPE effectively preserves visual signals over long contexts.
Integrating DIPE with Multimodal RoPE improves long-term visual grounding.
The method maintains performance on short-context benchmarks.
Abstract
Despite the remarkable capabilities of Multimodal Large Language Models (MLLMs), they still suffer from visual fading in long-context scenarios. Specifically, the attention to visual tokens diminishes as the text sequence lengthens, leading to text generation detached from visual constraints. We attribute this degradation to the inherent inductive bias of Multimodal RoPE, which penalizes inter-modal attention as the distance between visual and text tokens increases. To address this, we propose inter-modal Distance Invariant Position Encoding (DIPE), a simple but effective mechanism that disentangles position encoding based on modality interactions. DIPE retains the natural relative positioning for intra-modal interactions to preserve local structure, while enforcing an anchored perceptual proximity for inter-modal interactions. This strategy effectively mitigates the inter-modal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Face recognition and analysis
