ID-Align: RoPE-Conscious Position Remapping for Dynamic High-Resolution Adaptation in Vision-Language Models
Bozhou Li, Wentao Zhang

TL;DR
ID-Align introduces a novel position remapping technique for vision-language models that improves high-resolution image understanding by reordering position IDs, leading to significant performance gains across multiple benchmarks.
Contribution
The paper proposes ID-Align, a new method that reorders position IDs to enhance the interaction between high-resolution and thumbnail tokens in VLMs, addressing RoPE's decay issue.
Findings
Achieves 6.09% improvement on MMBench relation reasoning tasks.
Demonstrates significant gains across multiple benchmarks.
Effectively enhances high-resolution image understanding in VLMs.
Abstract
Currently, a prevalent approach for enhancing Vision-Language Models (VLMs) performance is to encode both the high-resolution version and the thumbnail of an image simultaneously. While effective, this method generates a large number of image tokens. When combined with the widely used Rotary Position Embedding (RoPE), its long-term decay property hinders the interaction between high-resolution tokens and thumbnail tokens, as well as between text and image. To address these issues, we propose ID-Align, which alleviates these problems by reordering position IDs. In this method, high-resolution tokens inherit IDs from their corresponding thumbnail token while constraining the overexpansion of positional indices. Our experiments conducted within the LLaVA-Next framework demonstrate that ID-Align achieves significant improvements, including a 6.09% enhancement on MMBench's relation reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Subtitles and Audiovisual Media
