FOCUS: Internal MLLM Representations for Efficient Fine-Grained Visual Question Answering
Liangyu Zhong, Fabio Rosenthal, Joachim Sicking, Fabian H\"uger, Thorsten Bagdonat, Hanno Gottschalk, Leo Schwinn

TL;DR
FOCUS is a training-free visual cropping method that uses internal MLLM representations to efficiently identify relevant image regions, significantly improving fine-grained VQA accuracy and efficiency without task-specific fine-tuning.
Contribution
It introduces a novel, training-free visual cropping approach leveraging MLLM internal features to enhance fine-grained VQA performance and efficiency.
Findings
Outperforms existing visual cropping methods in accuracy and efficiency.
Achieves comparable results to the best baseline with 3-6.5x less compute.
Effective across multiple datasets and MLLM types.
Abstract
While Multimodal Large Language Models (MLLMs) offer strong perception and reasoning capabilities for image-text input, Visual Question Answering (VQA) focusing on small image details still remains a challenge. Although visual cropping techniques seem promising, recent approaches have several limitations: the need for task-specific fine-tuning, low efficiency due to uninformed exhaustive search, or incompatibility with efficient attention implementations. We address these shortcomings by proposing a training-free visual cropping method, dubbed FOCUS, that leverages MLLM-internal representations to guide the search for the most relevant image region. This is accomplished in four steps: first, we identify the target object(s) in the VQA prompt; second, we compute an object relevance map using the key-value (KV) cache; third, we propose and rank relevant image regions based on the map; and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
