Visual-Aware CoT: Achieving High-Fidelity Visual Consistency in Unified Models

Zixuan Ye; Quande Liu; Cong Wei; Yuanxing Zhang; Xintao Wang; Pengfei Wan; Kun Gai; Wenhan Luo

arXiv:2512.19686·cs.CV·December 23, 2025

Visual-Aware CoT: Achieving High-Fidelity Visual Consistency in Unified Models

Zixuan Ye, Quande Liu, Cong Wei, Yuanxing Zhang, Xintao Wang, Pengfei Wan, Kun Gai, Wenhan Luo

PDF

Open Access

TL;DR

This paper introduces Visual-Aware CoT, a method that enhances multi-modal model generation by explicitly maintaining visual context consistency through structured planning and iterative correction, leading to improved visual fidelity.

Contribution

It proposes a novel approach integrating visual context consistency into reasoning, using visual planning and correction, with supervised fine-tuning and flow-GRPO for better multi-modal generation.

Findings

01

Outperforms zero-shot models in visual consistency

02

Improves multi-modal generation quality

03

Enhances visual feature preservation

Abstract

Recently, the introduction of Chain-of-Thought (CoT) has largely improved the generation ability of unified models. However, it is observed that the current thinking process during generation mainly focuses on the text consistency with the text prompt, ignoring the \textbf{visual context consistency} with the visual reference images during the multi-modal generation, e.g., multi-reference generation. The lack of such consistency results in the failure in maintaining key visual features (like human ID, object attribute, style). To this end, we integrate the visual context consistency into the reasoning of unified models, explicitly motivating the model to sustain such consistency by 1) Adaptive Visual Planning: generating structured visual check list to figure out the visual element of needed consistency keeping, and 2) Iterative Visual Correction: performing self-reflection with the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Data Visualization and Analytics