Breaking Dual Bottlenecks: Evolving Unified Multimodal Models into Self-Adaptive Interleaved Visual Reasoners

Qingyang Liu; Bingjie Gao; Canmiao Fu; Zhipeng Huang; Chen Li; Feng Wang; Shuochen Chang; Shaobo Wang; Yali Wang; Keming Ye; Jiangtong Li; and Li Niu

arXiv:2605.14709·cs.CV·May 15, 2026

Breaking Dual Bottlenecks: Evolving Unified Multimodal Models into Self-Adaptive Interleaved Visual Reasoners

Qingyang Liu, Bingjie Gao, Canmiao Fu, Zhipeng Huang, Chen Li, Feng Wang, Shuochen Chang, Shaobo Wang, Yali Wang, Keming Ye, Jiangtong Li, and Li Niu

PDF

1 Repo

TL;DR

This paper introduces a self-adaptive framework for unified multimodal models that dynamically switches between generation strategies to improve pixel-level manipulation and reasoning in anything-to-image tasks.

Contribution

It proposes a hierarchical data pipeline and a two-stage training strategy with adaptive modes, enhancing model flexibility and performance on complex visual reasoning tasks.

Findings

01

Outperforms existing baselines on X2I tasks.

02

Achieves higher fidelity in simple-to-complex instruction generation.

03

Demonstrates effective self-adaptive switching between generation modes.

Abstract

Recent unified models integrate multimodal understanding and generation within a single framework. However, an "understanding-generation gap" persists, where models can capture user intent but often fail to translate this semantic knowledge into precise pixel-level manipulation. This gap results in two bottlenecks in anything-to-image task (X2I): the attention entanglement bottleneck, where blind planning struggles with complex prompts, and the visual refinement bottleneck, where unstructured feedback fails to correct imperfections efficiently. In this paper, we propose a novel framework that empowers unified models to autonomously switch between generation strategies based on instruction complexity and model capability. To achieve this, we construct a hierarchical data pipeline that constructs execution paths across three adaptive modes: direct generation for simple cases,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

WeChatCV/Interleaved_Visual_Reasoner
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.