UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark

Yanlin Li; Minghui Guo; Kaiwen Zhang; Shize Zhang; Yiran Zhao; Haodong Li; Congyue Zhou; Weijie Zheng; Yushen Yan; Shengqiong Wu; Wei Ji; Lei Cui; Furu Wei; Hao Fei; Mong-Li Lee; Wynne Hsu

arXiv:2603.05075·cs.CV·March 6, 2026

UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark

Yanlin Li, Minghui Guo, Kaiwen Zhang, Shize Zhang, Yiran Zhao, Haodong Li, Congyue Zhou, Weijie Zheng, Yushen Yan, Shengqiong Wu, Wei Ji, Lei Cui, Furu Wei, Hao Fei, Mong-Li Lee, Wynne Hsu

PDF

Open Access

TL;DR

The paper introduces UniM, a comprehensive benchmark and evaluation suite for unified any-to-any interleaved multimodal understanding and generation, along with a baseline model to advance multimodal large language models.

Contribution

It presents the first unified benchmark dataset and evaluation framework for interleaved multimodal tasks, and proposes a baseline model with traceable reasoning capabilities.

Findings

01

UniM is highly challenging for current models.

02

The benchmark reveals key challenges in multimodal reasoning.

03

The baseline model demonstrates the potential for structured interleaved generation.

Abstract

In real-world multimodal applications, systems usually need to comprehend arbitrarily combined and interleaved multimodal inputs from users, while also generating outputs in any interleaved multimedia form. This capability defines the goal of any-to-any interleaved multimodal learning under a unified paradigm of understanding and generation, posing new challenges and opportunities for advancing Multimodal Large Language Models (MLLMs). To foster and benchmark this capability, this paper introduces the UniM benchmark, the first Unified Any-to-Any Interleaved Multimodal dataset. UniM contains 31K high-quality instances across 30 domains and 7 representative modalities: text, image, audio, video, document, code, and 3D, each requiring multiple intertwined reasoning and generation capabilities. We further introduce the UniM Evaluation Suite, which assesses models along three dimensions:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Generative Adversarial Networks and Image Synthesis