MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large   Vision-Language Models

Peng Xia; Siwei Han; Shi Qiu; Yiyang Zhou; Zhaoyang Wang; Wenhao; Zheng; Zhaorun Chen; Chenhang Cui; Mingyu Ding; Linjie Li; Lijuan Wang,; Huaxiu Yao

arXiv:2410.10139·cs.CV·April 1, 2025

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

Peng Xia, Siwei Han, Shi Qiu, Yiyang Zhou, Zhaoyang Wang, Wenhao, Zheng, Zhaorun Chen, Chenhang Cui, Mingyu Ding, Linjie Li, Lijuan Wang,, Huaxiu Yao

PDF

Open Access 1 Repo 1 Models 1 Datasets

TL;DR

MMIE is a comprehensive, large-scale benchmark designed to evaluate large vision-language models' ability to understand and generate interleaved multimodal content across diverse fields, with a new automated evaluation metric.

Contribution

The paper introduces MMIE, a large-scale, knowledge-intensive benchmark with a reliable automated evaluation metric for interleaved multimodal comprehension and generation in LVLMs.

Findings

01

Existing models show significant room for improvement.

02

MMIE effectively evaluates diverse competencies in LVLMs.

03

The proposed metric reduces bias and improves evaluation reliability.

Abstract

Interleaved multimodal comprehension and generation, enabling models to produce and interpret both images and text in arbitrary sequences, have become a pivotal area in multimodal learning. Despite significant advancements, the evaluation of this capability remains insufficient. Existing benchmarks suffer from limitations in data scale, scope, and evaluation depth, while current evaluation metrics are often costly or biased, lacking in reliability for practical applications. To address these challenges, we introduce MMIE, a large-scale knowledge-intensive benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs). MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts. It supports both interleaved inputs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Lillianwei-h/MMIE
pytorchOfficial

Models

🤗
MMIE/MMIE-Score
model· 4 dl· ♡ 1
4 dl♡ 1

Datasets

MMIE/MMIE
dataset· 83 dl
83 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques