How Long Can Unified Multimodal Models Generate Images Reliably? Taming Long-Horizon Interleaved Image Generation via Context Curation

Haoyu Chen; Qing Liu; Yuqian Zhou; He Zhang; Zhaowen Wang; Mengwei Ren; Jingjing Ren; Xiang Wang; Zhe Lin; Lei Zhu

arXiv:2603.07540·cs.CV·March 10, 2026

How Long Can Unified Multimodal Models Generate Images Reliably? Taming Long-Horizon Interleaved Image Generation via Context Curation

Haoyu Chen, Qing Liu, Yuqian Zhou, He Zhang, Zhaowen Wang, Mengwei Ren, Jingjing Ren, Xiang Wang, Zhe Lin, Lei Zhu

PDF

Open Access

TL;DR

This paper investigates the reliability issues in long-horizon multimodal image generation and introduces UniLongGen, a memory curation method that enhances stability and quality by actively forgetting interfering visual signals during inference.

Contribution

The paper identifies visual history pollution as a key problem and proposes UniLongGen, a novel, training-free inference strategy that dynamically curates memory to improve long-term generation fidelity.

Findings

01

UniLongGen improves long-horizon fidelity and consistency.

02

It reduces memory footprint and inference time.

03

Active forgetting is crucial for stable multimodal generation.

Abstract

Unified multimodal models hold the promise of generating extensive, interleaved narratives, weaving text and imagery into coherent long-form stories. However, current systems suffer from a critical reliability gap: as sequences grow, generation quality rapidly collapses. In this work, we investigate the mechanism behind this failure and argue that it is distinct from standard long-context challenges. We reveal that in generation, accumulated visual history acts as a source of active pollution, a decay governed specifically by the number of image events rather than raw token count. We identify a structural vulnerability where dense visual tokens overwhelm the attention mechanism, creating noise that distorts future synthesis. Guided by these mechanistic insights, we propose UniLongGen, a training-free inference strategy that prioritizes safe conditioning over total recall. Instead of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Aesthetic Perception and Analysis