ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning
Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, Linjie Li, Michael Qizhe Shieh, Yejin Choi, Ranjay Krishna, Yu Cheng

TL;DR
ThinkMorph is a unified multimodal reasoning model that generates interleaved text and image thoughts, improving performance on vision tasks and exhibiting emergent multimodal skills.
Contribution
It introduces a model fine-tuned on high-quality interleaved reasoning traces, demonstrating emergent multimodal capabilities and superior performance on vision-centric benchmarks.
Findings
Achieves 34.7% average improvement over the base model on vision benchmarks.
Generalizes well to out-of-domain tasks, surpassing larger VLMs.
Exhibits emergent skills like visual manipulation and adaptive reasoning modes.
Abstract
Multimodal reasoning requires iterative coordination between language and vision, yet it remains unclear what constitutes a meaningful interleaved chain of thought. We posit that text and image thoughts should function as complementary rather than isomorphic modalities that mutually advance reasoning. Guided by this principle, we build ThinkMorph, a unified model fine-tuned on approximately 24K high-quality interleaved reasoning traces spanning tasks with varying visual engagement. ThinkMorph learns to generate progressive text-image reasoning steps that concretely manipulate visual content while maintaining coherent verbal logic. It delivers large gains on vision-centric benchmarks (averaging 34.7 percent over the base model) and generalizes to out-of-domain tasks, matching or surpassing larger and proprietary VLMs. Beyond performance, ThinkMorph exhibits emergent multimodal…
Peer Reviews
Decision·ICLR 2026 Poster
- The problem is well-motivated, mixing reasoning across modalities and aiming to identify where it performs beyond unimodal reasoning and where it may not. - I appreciate the qualitative analysis and notes on emergent properties. - Results are evaluated on meaningful OOD benchmarks and provide some benefit in many settings. - The exploration of the emergent properties is interesting and sheds light on the method, such as the test-time mode dynamics of when mode-switching occurs in Figure 2.
- Results seem specific to the tasks specific considered, but are presented as general conclusions about interleaved reasoning more broadly. I appreciate the out-of-distribution results, but they generally target similar tasks -- like V-Star evaluating fine-grained visual search. - The idea that the multimodal setting has a larger reasoning space is intuitively appealing, but it's unclear if this is actually the origin of the test-time advantage just from the presented results. For instance, th
1. ThinkMorph pioneeringly introduces the "interleaved chain-of-thought" multimodal paradigm, enabling deep, dynamic, and complementary synergy between language and vision during reasoning. It moves beyond treating vision as passive input or an external tool, allowing the model to autonomously generate and modify visual content within its reasoning sequence, achieving intrinsic visual manipulation. This unique mechanism also fosters the emergence of novel visual operations and intelligent modali
1. The emergent visual manipulation capabilities, such as image inpainting and zoom-in, are presented as indicators of higher-level intelligence, yet their geometric and semantic consistency with the original image structure remains unvalidated. It is unclear whether the generated regions are aligned with the true physical structure or whether the model may produce visually plausible but physically incorrect details in the absence of explicit scale or spatial constraints. 2. Autonomous mode swi
- **Novel finding on interleaved reasoning**: A key contribution of the paper is the demonstration that training on interleaved text–image reasoning data enables the emergence of visual manipulation capabilities. This suggests that the model acquires a general-purpose mechanism for using visual representations as intermediate reasoning steps, rather than merely memorizing task-specific mappings. - **Synthetic data curation methodology**: The paper’s effectiveness largely stems from its carefull
- **"Emergent Property 1" claim is overstated.** In my personal experience, the original Bagel model was already capable of zooming in/out and inpainting images when explicitly prompted—for example, given a prompt like "find the banana in the image," it could crop out the banana region from the input image. This capability appears to pre-exist in Bagel and is not necessarily a result of ThinkMorph’s training. Therefore, the claim of Emergent Property 1 lacks sufficient support, as the described
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition · Action Observation and Synchronization
