MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models

Ruoxiang Huang; Zhen Yuan

arXiv:2604.12537·cs.CV·April 15, 2026

MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models

Ruoxiang Huang, Zhen Yuan

PDF

TL;DR

MODIX introduces a training-free, adaptive positional encoding method for vision-language models that reallocates attention to more informative modalities, enhancing multimodal reasoning without altering model architecture.

Contribution

It proposes a novel, training-free framework that dynamically adjusts positional indices based on information density, improving attention allocation in multimodal models.

Findings

01

MODIX improves performance across diverse benchmarks.

02

It reallocates attention to more informative modalities.

03

It works without modifying existing model parameters or architecture.

Abstract

Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet their positional encoding mechanisms remain suboptimal. Existing approaches uniformly assign positional indices to all tokens, overlooking variations in information density within and across modalities, which leads to inefficient attention allocation where redundant visual regions dominate while informative content is underrepresented. We identify positional granularity as an implicit resource and propose MODIX (Multimodal Information-Driven Positional IndeX Scaling), a training-free framework that dynamically adapts positional strides based on modality-specific contributions. MODIX jointly models intra-modal density via covariance-based entropy and inter-modal interaction via cross-modal alignment to derive unified scores, which rescale positional indices to allocate finer granularity to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.