FOCUS: Internal MLLM Representations for Efficient Fine-Grained Visual Question Answering

Liangyu Zhong; Fabio Rosenthal; Joachim Sicking; Fabian H\"uger; Thorsten Bagdonat; Hanno Gottschalk; Leo Schwinn

arXiv:2506.21710·cs.CV·October 30, 2025

FOCUS: Internal MLLM Representations for Efficient Fine-Grained Visual Question Answering

Liangyu Zhong, Fabio Rosenthal, Joachim Sicking, Fabian H\"uger, Thorsten Bagdonat, Hanno Gottschalk, Leo Schwinn

PDF

Open Access 1 Video

TL;DR

FOCUS is a training-free visual cropping method that uses internal MLLM representations to efficiently identify relevant image regions, significantly improving fine-grained VQA accuracy and efficiency without task-specific fine-tuning.

Contribution

It introduces a novel, training-free visual cropping approach leveraging MLLM internal features to enhance fine-grained VQA performance and efficiency.

Findings

01

Outperforms existing visual cropping methods in accuracy and efficiency.

02

Achieves comparable results to the best baseline with 3-6.5x less compute.

03

Effective across multiple datasets and MLLM types.

Abstract

While Multimodal Large Language Models (MLLMs) offer strong perception and reasoning capabilities for image-text input, Visual Question Answering (VQA) focusing on small image details still remains a challenge. Although visual cropping techniques seem promising, recent approaches have several limitations: the need for task-specific fine-tuning, low efficiency due to uninformed exhaustive search, or incompatibility with efficient attention implementations. We address these shortcomings by proposing a training-free visual cropping method, dubbed FOCUS, that leverages MLLM-internal representations to guide the search for the most relevant image region. This is accomplished in four steps: first, we identify the target object(s) in the VQA prompt; second, we compute an object relevance map using the key-value (KV) cache; third, we propose and rank relevant image regions based on the map; and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

FOCUS: Internal MLLM Representations for Efficient Fine-Grained Visual Question Answering· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning