Revisiting Multimodal Positional Encoding in Vision-Language Models

Jie Huang; Xuejing Liu; Sibo Song; Ruibing Hou; Hong Chang; Junyang Lin; Shuai Bai

arXiv:2510.23095·cs.CV·April 7, 2026

Revisiting Multimodal Positional Encoding in Vision-Language Models

Jie Huang, Xuejing Liu, Sibo Song, Ruibing Hou, Hong Chang, Junyang Lin, Shuai Bai

PDF

1 Repo 1 Video

TL;DR

This paper systematically analyzes multimodal Rotary Positional Embedding in vision-language models, proposing improved variants that enhance understanding without architectural changes.

Contribution

It introduces new multimodal RoPE variants, MHRoPE and MRoPE-I, based on key guidelines, improving performance across benchmarks.

Findings

01

Proposed variants outperform existing methods on multiple benchmarks.

02

Identified key guidelines: positional coherence, full frequency use, and preservation of textual priors.

03

Achieved significant improvements in multimodal understanding tasks.

Abstract

Multimodal position encoding is essential for vision-language models, yet there has been little systematic investigation into multimodal position encoding. We conduct a comprehensive analysis of multimodal Rotary Positional Embedding (RoPE) by examining its two core components: position design and frequency allocation. Through extensive experiments, we identify three key guidelines: positional coherence, full frequency utilization, and preservation of textual priors-ensuring unambiguous layout, rich representation, and faithful transfer from the pre-trained LLM. Based on these insights, we propose Multi-Head RoPE (MHRoPE) and MRoPE-Interleave (MRoPE-I), two simple and plug-and-play variants that require no architectural changes. Our methods consistently outperform existing approaches across diverse benchmarks, with significant improvements in both general and fine-grained multimodal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

JJJYmmm/Multimodal-RoPEs
github

Videos

Revisiting Multimodal Positional Encoding in Vision–Language Models· slideslive