Non-Markov Multi-Round Conversational Image Generation with History-Conditioned MLLMs

Haochen Zhang; Animesh Sinha; Felix Juefei-Xu; Haoyu Ma; Kunpeng Li; Zhipeng Fan; Meng Dong; Xiaoliang Dai; Tingbo Hou; Peizhao Zhang; Zecheng He

arXiv:2601.20911·cs.CV·January 30, 2026

Non-Markov Multi-Round Conversational Image Generation with History-Conditioned MLLMs

Haochen Zhang, Animesh Sinha, Felix Juefei-Xu, Haoyu Ma, Kunpeng Li, Zhipeng Fan, Meng Dong, Xiaoliang Dai, Tingbo Hou, Peizhao Zhang, Zecheng He

PDF

Open Access

TL;DR

This paper introduces a new framework for multi-round conversational image generation that accounts for long-range history, enabling more consistent and personalized interactions beyond Markov assumptions.

Contribution

It proposes novel data construction, training, and inference methods for non-Markov multi-round interactions in multimodal models, improving long-term consistency and personalization.

Findings

01

Enhanced multi-round consistency and instruction compliance.

02

Maintained high-fidelity image reconstruction and personalization.

03

Effective handling of long-range history in conversational image generation.

Abstract

Conversational image generation requires a model to follow user instructions across multiple rounds of interaction, grounded in interleaved text and images that accumulate as chat history. While recent multimodal large language models (MLLMs) can generate and edit images, most existing multi-turn benchmarks and training recipes are effectively Markov: the next output depends primarily on the most recent image, enabling shortcut solutions that ignore long-range history. In this work we formalize and target the more challenging non-Markov setting, where a user may refer back to earlier states, undo changes, or reference entities introduced several rounds ago. We present (i) non-Markov multi-round data construction strategies, including rollback-style editing that forces retrieval of earlier visual states and name-based multi-round personalization that binds names to appearances across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling