MileBench: Benchmarking MLLMs in Long Context

Dingjie Song; Shunian Chen; Guiming Hardy Chen; Fei Yu; Xiang Wan,; Benyou Wang

arXiv:2404.18532·cs.CL·May 16, 2024·1 cites

MileBench: Benchmarking MLLMs in Long Context

Dingjie Song, Shunian Chen, Guiming Hardy Chen, Fei Yu, Xiang Wan,, Benyou Wang

PDF

Open Access 1 Datasets

TL;DR

MileBench is a new benchmark designed to evaluate Multimodal Large Language Models' ability to handle long contexts and multiple images across various tasks, revealing current limitations especially in open-source models.

Contribution

This paper introduces MileBench, the first comprehensive benchmark for testing MLLMs on long-context, multi-image tasks, filling a gap in existing evaluation methods.

Findings

01

GPT-4o outperforms other models in long-context tasks.

02

Most open-source MLLMs struggle with long contexts, especially with multiple images.

03

Performance gaps increase as the number of images grows.

Abstract

Despite the advancements and impressive performance of Multimodal Large Language Models (MLLMs) on benchmarks, their effectiveness in real-world, long-context, and multi-image tasks is unclear due to the benchmarks' limited scope. Existing benchmarks often focus on single-image and short-text samples, and when assessing multi-image tasks, they either limit the image count or focus on specific task (e.g time-series captioning), potentially obscuring the performance challenges of MLLMs. To address these limitations, we introduce MileBench, a pioneering benchmark designed to test the MultImodal Long-contExt capabilities of MLLMs. This benchmark comprises not only multimodal long contexts, but also multiple tasks requiring both comprehension and generation. We establish two distinct evaluation sets, diagnostic and realistic, to systematically assess MLLMs' long-context adaptation capacity…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

FreedomIntelligence/MileBench
dataset· 746 dl
746 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOpen Education and E-Learning

MethodsFocus