Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision
Luozheng Qin, Jia Gong, Yuqing Sun, Tianjiao Li, Mengping Yang, Xiaomeng Yang, Chao Qu, Zhiyu Tan, Hao Li

TL;DR
Uni-CoT introduces a unified, efficient framework for coherent multimodal reasoning across text and vision, achieving state-of-the-art results with reduced computational costs.
Contribution
The paper presents Uni-CoT, a novel two-level reasoning paradigm and structured training method enabling scalable, grounded multimodal reasoning within a single model.
Findings
Achieves SOTA performance on WISE, RISE, and KRIS benchmarks.
Demonstrates efficient reasoning with only 8 A100 GPUs.
Provides a unified approach for high-level planning and subtask execution.
Abstract
Chain-of-Thought (CoT) reasoning has been widely adopted to enhance Large Language Models (LLMs) by decomposing complex tasks into simpler, sequential subtasks. However, extending CoT to vision-language reasoning tasks remains challenging, as it often requires interpreting transitions of visual states to support reasoning. Existing methods often struggle with this due to limited capacity of modeling visual state transitions or incoherent visual trajectories caused by fragmented architectures. To overcome these limitations, we propose Uni-CoT, a Unified Chain-of-Thought framework that enables coherent and grounded multimodal reasoning within a single unified model. The key idea is to leverage a model capable of both image understanding and generation to reason over visual content and model evolving visual states. However, empowering a unified model to achieve that is non-trivial, given…
Peer Reviews
Decision·ICLR 2026 Poster
1. This paper proposes a unified framework that integrates textual and visual Chain-of-Thought reasoning in one model. It introduces a hierarchical macro–micro design that improves reasoning structure, coherence, and interpretability. 2. The method employs an MDP-based self-reflective process that reduces complexity from quadratic to near-linear while enhancing robustness. 3. The method achieves strong results on multiple benchmarks for both image generation and understanding.
1. Despite reduced complexity, multimodal reasoning remains computationally expensive due to the large number of visual tokens per step. 2. The dataset used for training is relatively small and partially synthetic, which may limit generalization to real-world multimodal scenarios. 3. The paper could be strengthened by providing a quantitative comparison of token usage during inference against baseline models.
- The technical contribution under the unified model such as BAGEL is effective. - The proposed method significantly reduces computational complexity, making long-horizon multi-modal reasoning more tractable. - The paper includes both quantitative and qualitative results across multiple challenging benchmarks, including reasoning-driven generation and understanding tasks. - Ablation on CoT mechanisms and training strategies provides strong evidence for the contribution of each component. - The
- The evaluation results on understanding tasks are not sufficient. The authors only provide results on the MME benchmark, which is not representative of supporting the claim of unified tasks. Lack of more general and widely-used benchmarks such as MMMU, MMBench, OCRBench, MathVista, MathVision, etc. - Experiments are conducted only on the BAGEL backbone. It is unclear how the framework would perform with other MLLMs or in a more model-agnostic setting.
1. Originality: This paper proposes a unified framework for multi-modal chain-of-thought (CoT) reasoning, bridging visual and textual reasoning within a single coherent process. The Markov Decision Process formulation in this paper reduces the computation overhead. 2. Quality: The proposed method is well-motivated, combining goal decomposition with self-reflective decision-making. The experiments are comprehensive, covering both understanding and generation tasks, and the results consistently s
1. The main novelty is kind of limited; there are some similar works using CoT for image generation refinement, like LayerCraft [1]. Efficient attention is also a well-studied area. However, combining them might be the first try. 2. The training dataset for CoT image generation purely depends on synthetic data. This limits generalization to open-domain or real-world visual reasoning tasks. 3. Figure S3 in the appendix seems not accurately reflect a realistic ground-level scene, which raises ques
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
