Finding Needles in Images: Can Multimodal LLMs Locate Fine Details?

Parth Thakkar; Ankush Agarwal; Prasad Kasu; Pulkit Bansal; Chaitanya Devaguptapu

arXiv:2508.05053·cs.CV·August 8, 2025

Finding Needles in Images: Can Multimodal LLMs Locate Fine Details?

Parth Thakkar, Ankush Agarwal, Prasad Kasu, Pulkit Bansal, Chaitanya Devaguptapu

PDF

TL;DR

This paper introduces NiM, a benchmark for evaluating multimodal large language models' ability to locate fine details in complex documents, and proposes Spot-IT, a method that improves detail extraction through intelligent patch selection and Gaussian attention.

Contribution

The paper presents NiM, a new benchmark for fine-grained document understanding, and Spot-IT, a novel approach that enhances MLLMs' precision in locating details.

Findings

01

Spot-IT significantly outperforms baseline methods.

02

MLLMs show limitations in fine-grained detail localization.

03

Spot-IT improves accuracy in complex document tasks.

Abstract

While Multi-modal Large Language Models (MLLMs) have shown impressive capabilities in document understanding tasks, their ability to locate and reason about fine-grained details within complex documents remains understudied. Consider searching a restaurant menu for a specific nutritional detail or identifying a disclaimer in a lengthy newspaper article tasks that demand careful attention to small but significant details within a broader narrative, akin to Finding Needles in Images (NiM). To address this gap, we introduce NiM, a carefully curated benchmark spanning diverse real-world documents including newspapers, menus, and lecture images, specifically designed to evaluate MLLMs' capability in these intricate tasks. Building on this, we further propose Spot-IT, a simple yet effective approach that enhances MLLMs capability through intelligent patch selection and Gaussian attention,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.