MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models
Ruoxiang Huang, Zhen Yuan

TL;DR
MODIX introduces a training-free, adaptive positional encoding method for vision-language models that reallocates attention to more informative modalities, enhancing multimodal reasoning without altering model architecture.
Contribution
It proposes a novel, training-free framework that dynamically adjusts positional indices based on information density, improving attention allocation in multimodal models.
Findings
MODIX improves performance across diverse benchmarks.
It reallocates attention to more informative modalities.
It works without modifying existing model parameters or architecture.
Abstract
Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet their positional encoding mechanisms remain suboptimal. Existing approaches uniformly assign positional indices to all tokens, overlooking variations in information density within and across modalities, which leads to inefficient attention allocation where redundant visual regions dominate while informative content is underrepresented. We identify positional granularity as an implicit resource and propose MODIX (Multimodal Information-Driven Positional IndeX Scaling), a training-free framework that dynamically adapts positional strides based on modality-specific contributions. MODIX jointly models intra-modal density via covariance-based entropy and inter-modal interaction via cross-modal alignment to derive unified scores, which rescale positional indices to allocate finer granularity to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
