A Training-Free, Task-Agnostic Framework for Enhancing MLLM Performance on High-Resolution Images

Jaeseong Lee; Yeeun Choi; Heechan Choi; Hanjung Kim; Seonjoo Kim

arXiv:2507.10202·cs.CV·July 15, 2025

A Training-Free, Task-Agnostic Framework for Enhancing MLLM Performance on High-Resolution Images

Jaeseong Lee, Yeeun Choi, Heechan Choi, Hanjung Kim, Seonjoo Kim

PDF

TL;DR

This paper introduces ECP, a training-free, task-agnostic framework that improves MLLM performance on high-resolution images by identifying candidate regions and refining predictions, addressing resolution mismatch issues.

Contribution

ECP is a novel two-stage method that enhances MLLM accuracy on high-res images without additional training, leveraging implicit localization cues from coarse predictions.

Findings

01

Achieved +21.3% improvement on 4K GUI grounding

02

Gained +5.8% on 4K MLLM perception

03

Secured +5.2% on 8K MLLM perception

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in vision-language understanding, reasoning, and generation. However, they struggle with tasks requiring fine-grained localization and reasoning in high-resolution images. This constraint stems from the fact that MLLMs are fine-tuned with fixed image resolution to align with the pre-trained image encoder used in MLLM. Consequently, feeding high-resolution images directly into MLLMs leads to poor generalization due to a train-test resolution discrepancy, while downsampling these images-although ensuring consistency-compromises fine-grained visual details and ultimately degrades performance. To address this challenge, we propose Extract Candidate then Predict (ECP), a novel training-free, task-agnostic two-stage framework designed to enhance MLLM performance on high-resolution images. The key intuition…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.