OMEGA: Optimized Multimodal Position Encoding Index Derivation with Global Adaptive Scaling for Vision-Language Models

Ruoxiang Huang; Xindian Ma; Rundong Kong; Zhen Yuan; Peng Zhang

arXiv:2511.00821·cs.CV·November 4, 2025

OMEGA: Optimized Multimodal Position Encoding Index Derivation with Global Adaptive Scaling for Vision-Language Models

Ruoxiang Huang, Xindian Ma, Rundong Kong, Zhen Yuan, Peng Zhang

PDF

Open Access

TL;DR

OMEGA introduces a modality-specific position encoding framework with adaptive scaling, significantly improving vision-language model performance by better preserving structural properties of visual and textual data.

Contribution

The paper proposes OMEGA, a novel position encoding method that employs modality-specific encoding and adaptive scaling to enhance multimodal model effectiveness.

Findings

01

Up to 3.43% performance improvement on VQA benchmarks.

02

Consistent gains across multiple architectures and model sizes.

03

Effective preservation of modality-specific structural information.

Abstract

Vision-Language Models (VLMs) have demonstrated strong performance across various multimodal tasks, where position encoding plays a vital role in modeling both the sequential structure of textual information and the spatial structure of visual information. However, current VLMs commonly adopt modality-unified 1D or 2D positional indexing strategies, which treat textual and visual tokens uniformly without accounting for their distinct structural properties and sequential continuity for text and spatial coherence for vision. To address this limitation, we propose OMEGA, a novel position encoding framework that employs Modality-Specific Position Encoding (MSPE) to assign positional indices while preserving the inherent structures of each modality across separate coordinate dimensions. Additionally, to align the information density of multimodal data in the positional index space, OMEGA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Image and Video Retrieval Techniques