Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of   Multimodal Large Language Models

Hengyi Wang; Haizhou Shi; Shiwei Tan; Weiyi Qin; Wenyuan Wang; Tunyu; Zhang; Akshay Nambi; Tanuja Ganu; Hao Wang

arXiv:2406.11230·cs.LG·February 12, 2025·1 cites

Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models

Hengyi Wang, Haizhou Shi, Shiwei Tan, Weiyi Qin, Wenyuan Wang, Tunyu, Zhang, Akshay Nambi, Tanuja Ganu, Hao Wang

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper introduces the MMNeedle benchmark to evaluate the long-context capabilities of multimodal large language models, revealing performance gaps and hallucination issues, especially in negative retrieval scenarios.

Contribution

The paper presents the MMNeedle benchmark for assessing long-context multimodal models and provides a comprehensive evaluation of state-of-the-art models using this new benchmark.

Findings

01

GPT-4o outperforms other models in long-context tasks

02

Models exhibit hallucination problems in negative samples

03

Significant performance gap between API-based and open-source models

Abstract

Multimodal Large Language Models (MLLMs) have shown significant promise in various applications, leading to broad interest from researchers and practitioners alike. However, a comprehensive evaluation of their long-context capabilities remains underexplored. To address these gaps, we introduce the MultiModal Needle-in-a-haystack (MMNeedle) benchmark, specifically designed to assess the long-context capabilities of MLLMs. Besides multi-image input, we employ image stitching to further increase the input context length, and develop a protocol to automatically generate labels for sub-image level retrieval. Essentially, MMNeedle evaluates MLLMs by stress-testing their capability to locate a target sub-image (needle) within a set of images (haystack) based on textual instructions and descriptions of image contents. This setup necessitates an advanced understanding of extensive visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wang-ml-lab/multimodal-needle-in-a-haystack
noneOfficial

Datasets

Wang-ML-Lab/MMNeedle
dataset· 65 dl
65 dl

Videos

Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models· underline

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Topic Modeling

MethodsSparse Evolutionary Training