Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models
Hengyi Wang, Haizhou Shi, Shiwei Tan, Weiyi Qin, Wenyuan Wang, Tunyu, Zhang, Akshay Nambi, Tanuja Ganu, Hao Wang

TL;DR
This paper introduces the MMNeedle benchmark to evaluate the long-context capabilities of multimodal large language models, revealing performance gaps and hallucination issues, especially in negative retrieval scenarios.
Contribution
The paper presents the MMNeedle benchmark for assessing long-context multimodal models and provides a comprehensive evaluation of state-of-the-art models using this new benchmark.
Findings
GPT-4o outperforms other models in long-context tasks
Models exhibit hallucination problems in negative samples
Significant performance gap between API-based and open-source models
Abstract
Multimodal Large Language Models (MLLMs) have shown significant promise in various applications, leading to broad interest from researchers and practitioners alike. However, a comprehensive evaluation of their long-context capabilities remains underexplored. To address these gaps, we introduce the MultiModal Needle-in-a-haystack (MMNeedle) benchmark, specifically designed to assess the long-context capabilities of MLLMs. Besides multi-image input, we employ image stitching to further increase the input context length, and develop a protocol to automatically generate labels for sub-image level retrieval. Essentially, MMNeedle evaluates MLLMs by stress-testing their capability to locate a target sub-image (needle) within a set of images (haystack) based on textual instructions and descriptions of image contents. This setup necessitates an advanced understanding of extensive visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems · Topic Modeling
MethodsSparse Evolutionary Training
