Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models

Woojun Jung; Jaehoon Go; Mingyu Jeon; Sunjae Yoon; Junyeong Kim

arXiv:2512.10362·cs.CV·April 28, 2026

Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models

Woojun Jung, Jaehoon Go, Mingyu Jeon, Sunjae Yoon, Junyeong Kim

PDF

1 Repo

TL;DR

Visual Funnel is a training-free method that enhances multimodal large language models by preserving hierarchical visual context, effectively addressing the 'Contextual Blindness' problem caused by structural disconnects.

Contribution

We introduce Visual Funnel, a novel two-step approach that dynamically constructs hierarchical visual context to improve fine-grained perception in multimodal models.

Findings

01

Visual Funnel outperforms naive single-crop baselines.

02

Adding unstructured crops offers limited benefits, highlighting the importance of hierarchical structure.

03

The method effectively resolves 'Contextual Blindness' in multimodal models.

Abstract

Multimodal Large Language Models (MLLMs) demonstrate impressive reasoning capabilities, but often fail to perceive fine-grained visual details, limiting their applicability in precision-demanding tasks. While methods that crop salient regions of an image offer a partial solution, we identify a critical limitation they introduce: "Contextual Blindness". This failure occurs due to structural disconnect between high-fidelity details (from the crop) and the broader global context (from the original image), even when all necessary visual information is present. We argue that this limitation stems not from a lack of information 'Quantity', but from a lack of 'Structural Diversity' in the model's input. To resolve this, we propose Visual Funnel, a training-free, two-step approach. Visual Funnel first performs Contextual Anchoring to identify the region of interest in a single forward pass. It…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jungnerd/Visual-Funnel
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.