MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models

Mingrui Wu; Hang Liu; Jiayi Ji; Xiaoshuai Sun; Rongrong Ji

arXiv:2602.19497·cs.CV·February 24, 2026

MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models

Mingrui Wu, Hang Liu, Jiayi Ji, Xiaoshuai Sun, Rongrong Ji

PDF

Open Access

TL;DR

MICON-Bench introduces a new comprehensive benchmark for multi-image context generation in multimodal models, along with a novel evaluation framework and a dynamic attention mechanism to improve coherence and reduce hallucinations.

Contribution

This work presents MICON-Bench, a new benchmark for multi-image reasoning, an MLLM-driven evaluation framework, and a plug-and-play attention rebalancing method for enhanced image generation.

Findings

01

MICON-Bench effectively exposes multi-image reasoning challenges.

02

DAR improves coherence and reduces hallucinations in image generation.

03

The evaluation framework automates semantic and visual consistency verification.

Abstract

Recent advancements in Unified Multimodal Models (UMMs) have enabled remarkable image understanding and generation capabilities. However, while models like Gemini-2.5-Flash-Image show emerging abilities to reason over multiple related images, existing benchmarks rarely address the challenges of multi-image context generation, focusing mainly on text-to-image or single-image editing tasks. In this work, we introduce \textbf{MICON-Bench}, a comprehensive benchmark covering six tasks that evaluate cross-image composition, contextual reasoning, and identity preservation. We further propose an MLLM-driven Evaluation-by-Checkpoint framework for automatic verification of semantic and visual consistency, where multimodal large language model (MLLM) serves as a verifier. Additionally, we present \textbf{Dynamic Attention Rebalancing (DAR)}, a training-free, plug-and-play mechanism that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning