Finding Needles in Images: Can Multimodal LLMs Locate Fine Details?
Parth Thakkar, Ankush Agarwal, Prasad Kasu, Pulkit Bansal, Chaitanya Devaguptapu

TL;DR
This paper introduces NiM, a benchmark for evaluating multimodal large language models' ability to locate fine details in complex documents, and proposes Spot-IT, a method that improves detail extraction through intelligent patch selection and Gaussian attention.
Contribution
The paper presents NiM, a new benchmark for fine-grained document understanding, and Spot-IT, a novel approach that enhances MLLMs' precision in locating details.
Findings
Spot-IT significantly outperforms baseline methods.
MLLMs show limitations in fine-grained detail localization.
Spot-IT improves accuracy in complex document tasks.
Abstract
While Multi-modal Large Language Models (MLLMs) have shown impressive capabilities in document understanding tasks, their ability to locate and reason about fine-grained details within complex documents remains understudied. Consider searching a restaurant menu for a specific nutritional detail or identifying a disclaimer in a lengthy newspaper article tasks that demand careful attention to small but significant details within a broader narrative, akin to Finding Needles in Images (NiM). To address this gap, we introduce NiM, a carefully curated benchmark spanning diverse real-world documents including newspapers, menus, and lecture images, specifically designed to evaluate MLLMs' capability in these intricate tasks. Building on this, we further propose Spot-IT, a simple yet effective approach that enhances MLLMs capability through intelligent patch selection and Gaussian attention,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
