TL;DR
This paper systematically analyzes multimodal Rotary Positional Embedding in vision-language models, proposing improved variants that enhance understanding without architectural changes.
Contribution
It introduces new multimodal RoPE variants, MHRoPE and MRoPE-I, based on key guidelines, improving performance across benchmarks.
Findings
Proposed variants outperform existing methods on multiple benchmarks.
Identified key guidelines: positional coherence, full frequency use, and preservation of textual priors.
Achieved significant improvements in multimodal understanding tasks.
Abstract
Multimodal position encoding is essential for vision-language models, yet there has been little systematic investigation into multimodal position encoding. We conduct a comprehensive analysis of multimodal Rotary Positional Embedding (RoPE) by examining its two core components: position design and frequency allocation. Through extensive experiments, we identify three key guidelines: positional coherence, full frequency utilization, and preservation of textual priors-ensuring unambiguous layout, rich representation, and faithful transfer from the pre-trained LLM. Based on these insights, we propose Multi-Head RoPE (MHRoPE) and MRoPE-Interleave (MRoPE-I), two simple and plug-and-play variants that require no architectural changes. Our methods consistently outperform existing approaches across diverse benchmarks, with significant improvements in both general and fine-grained multimodal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
