OMEGA: Optimized Multimodal Position Encoding Index Derivation with Global Adaptive Scaling for Vision-Language Models
Ruoxiang Huang, Xindian Ma, Rundong Kong, Zhen Yuan, Peng Zhang

TL;DR
OMEGA introduces a modality-specific position encoding framework with adaptive scaling, significantly improving vision-language model performance by better preserving structural properties of visual and textual data.
Contribution
The paper proposes OMEGA, a novel position encoding method that employs modality-specific encoding and adaptive scaling to enhance multimodal model effectiveness.
Findings
Up to 3.43% performance improvement on VQA benchmarks.
Consistent gains across multiple architectures and model sizes.
Effective preservation of modality-specific structural information.
Abstract
Vision-Language Models (VLMs) have demonstrated strong performance across various multimodal tasks, where position encoding plays a vital role in modeling both the sequential structure of textual information and the spatial structure of visual information. However, current VLMs commonly adopt modality-unified 1D or 2D positional indexing strategies, which treat textual and visual tokens uniformly without accounting for their distinct structural properties and sequential continuity for text and spatial coherence for vision. To address this limitation, we propose OMEGA, a novel position encoding framework that employs Modality-Specific Position Encoding (MSPE) to assign positional indices while preserving the inherent structures of each modality across separate coordinate dimensions. Additionally, to align the information density of multimodal data in the positional index space, OMEGA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Image and Video Retrieval Techniques
