TL;DR
This paper introduces VPSG, a training-free inference correction method that mitigates coordinate prediction bias caused by positional encoding failures in multimodal models, improving localization accuracy.
Contribution
The paper presents VPSG, a novel inference-time correction technique that addresses positional encoding failures without retraining, enhancing coordinate prediction in vision-language models.
Findings
VPSG effectively corrects coordinate drift in models.
VPSG improves localization accuracy across various model scales.
The method does not require retraining or fine-tuning.
Abstract
While Multimodal Large Language Models (MLLMs) excel at general vision-language tasks, precise coordinate prediction remains a significant challenge, particularly as high-resolution inputs cause visual positional encodings (VPEs) to degrade. We demonstrate that these encoding failures do not result in random noise but instead trigger predictable, directional biases, suggesting that models default to internal spatial priors when grounding signals are weak. To counteract this, we introduce Vision-PE Shuffle Guidance (VPSG), a training-free, inference-time correction method. VPSG isolates position-unconditioned tendencies by shuffling VPEs and utilizes this negative evidence to steer digit decoding through a lightweight finite-state machine. Evaluation on the ScreenSpot-Pro benchmark confirms that VPSG effectively rectifies coordinate drift, yielding consistent improvements in localization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
