ID-Align: RoPE-Conscious Position Remapping for Dynamic High-Resolution Adaptation in Vision-Language Models

Bozhou Li; Wentao Zhang

arXiv:2505.21465·cs.CV·May 28, 2025

ID-Align: RoPE-Conscious Position Remapping for Dynamic High-Resolution Adaptation in Vision-Language Models

Bozhou Li, Wentao Zhang

PDF

Open Access

TL;DR

ID-Align introduces a novel position remapping technique for vision-language models that improves high-resolution image understanding by reordering position IDs, leading to significant performance gains across multiple benchmarks.

Contribution

The paper proposes ID-Align, a new method that reorders position IDs to enhance the interaction between high-resolution and thumbnail tokens in VLMs, addressing RoPE's decay issue.

Findings

01

Achieves 6.09% improvement on MMBench relation reasoning tasks.

02

Demonstrates significant gains across multiple benchmarks.

03

Effectively enhances high-resolution image understanding in VLMs.

Abstract

Currently, a prevalent approach for enhancing Vision-Language Models (VLMs) performance is to encode both the high-resolution version and the thumbnail of an image simultaneously. While effective, this method generates a large number of image tokens. When combined with the widely used Rotary Position Embedding (RoPE), its long-term decay property hinders the interaction between high-resolution tokens and thumbnail tokens, as well as between text and image. To address these issues, we propose ID-Align, which alleviates these problems by reordering position IDs. In this method, high-resolution tokens inherit IDs from their corresponding thumbnail token while constraining the overexpansion of positional indices. Our experiments conducted within the LLaVA-Next framework demonstrate that ID-Align achieves significant improvements, including a 6.09% enhancement on MMBench's relation reasoning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Subtitles and Audiovisual Media