Synergizing Understanding and Generation with Interleaved Analyzing-Drafting Thinking
Shengqiong Wu, Bobo Li, Xinkai Wang, Xiangtai Li, Lei Cui, Furu Wei, Shuicheng Yan, Hao Fei, Tat-seng Chua

TL;DR
This paper introduces AD-Loop, a novel think paradigm that interleaves analyzing and drafting steps in vision-language models, significantly enhancing their ability to understand and generate multimodal content through iterative refinement.
Contribution
The paper proposes the interleaved Analyzing-Drafting (AD-Loop) mechanism and a two-stage training strategy, enabling models to dynamically alternate between comprehension and creation for improved synergy.
Findings
AD-Loop improves performance on standard benchmarks.
It enhances transferability across UVLM architectures.
Visual analysis confirms effective implicit visual thoughts.
Abstract
Unified Vision-Language Models (UVLMs) aim to advance multimodal learning by supporting both understanding and generation within a single framework. However, existing approaches largely focus on architectural unification while overlooking the need for explicit interaction between the two capabilities during task solving. As a result, current models treat understanding and generation as parallel skills rather than synergistic processes. To achieve real synergy, we introduce the interleaved Analyzing-Drafting problem-solving loop (AD-Loop), a new think paradigm that dynamically alternates between analytic and drafting operations. By interleaving textual thoughts with visual thoughts, AD-Loop enables models to iteratively refine both comprehension and outputs, fostering genuine synergy. To train this mechanism, we design a two-stage strategy: supervised learning on interleaved thought data…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper introduces a two-stage training strategy: (1) supervised learning on interleaved thought data to initialize the alternation, followed by (2) reinforcement learning to promote adaptive and autonomous control. - The paper conducts extensive experiments and ablation studies to validate the effectiveness of the proposed method. - AD-Loop consistently improves performance across standard benchmarks for both understanding and generation, and further showed the adaptability of the proposed
- The definition of the inter-group reward in Equation (6) is unclear. What does $m$ represent? Does it indicate whether the trajectory is AD-Loop-enabled or not? Additionally, in Equation (7), the intra- and inter-reward terms on the right-hand side seem to be missing the superscript $m$? - In the ablation study (Table 3), it is unclear whether the paper trained three additional variants corresponding to different thinking strategies and evaluated them separately, or whether they used the final
1. The interleaved analyzing–drafting mechanism establishes a tighter synergy between vision understanding and generation than prior “unified” models. This new paradigm addresses a clear gap by turning generation and analysis into mutually reinforcing steps rather than independent skills. 2. The two-stage training strategy (SFT of interleaved reasoning followed by RL) is well-motivated. This pipeline enables the model to learn the complex analyze-then-draft procedure in a guided way, then optim
1. The framework introduces significant complexity in both training and inference. It requires a specialized two-phase training (including an RL stage), and at runtime the model must perform multiple analyze–draft iterations per query. This likely incurs substantial computational cost and latency compared to standard one-pass models. The paper does not discuss inference speed or resource requirements, which raises practical concerns for real-world deployment. 2. The method relies on a curated i
The authors address an important issue with what seems to be an innovative approach. The comprehensive evaluation for understanding and generation is also a strength. Ablation of the thinking types section was also a strength. Good discussion addressing interesting questions spanning extensions into other MLLMs, whether visual thoughts should be derived from understanding vs. the generation encoder, and the visualization of implicit visual thoughts. Very thorough methods section with detail
I was unclear about how novel the work is. There are many publications using visual representations (imagination) to augment language reasoning. Please expand on how this is different. Please show some examples in Figure 7 where the proposed method did not generate better results. This would also be interesting. There were parts where I had a difficult time following the methods and even the evaluation. The paper is dense, but other reviewers more expert in the field might have an easier tim
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Language, Metaphor, and Cognition
