A Training-Free, Task-Agnostic Framework for Enhancing MLLM Performance on High-Resolution Images
Jaeseong Lee, Yeeun Choi, Heechan Choi, Hanjung Kim, Seonjoo Kim

TL;DR
This paper introduces ECP, a training-free, task-agnostic framework that improves MLLM performance on high-resolution images by identifying candidate regions and refining predictions, addressing resolution mismatch issues.
Contribution
ECP is a novel two-stage method that enhances MLLM accuracy on high-res images without additional training, leveraging implicit localization cues from coarse predictions.
Findings
Achieved +21.3% improvement on 4K GUI grounding
Gained +5.8% on 4K MLLM perception
Secured +5.2% on 8K MLLM perception
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in vision-language understanding, reasoning, and generation. However, they struggle with tasks requiring fine-grained localization and reasoning in high-resolution images. This constraint stems from the fact that MLLMs are fine-tuned with fixed image resolution to align with the pre-trained image encoder used in MLLM. Consequently, feeding high-resolution images directly into MLLMs leads to poor generalization due to a train-test resolution discrepancy, while downsampling these images-although ensuring consistency-compromises fine-grained visual details and ultimately degrades performance. To address this challenge, we propose Extract Candidate then Predict (ECP), a novel training-free, task-agnostic two-stage framework designed to enhance MLLM performance on high-resolution images. The key intuition…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
