Visual-Aware CoT: Achieving High-Fidelity Visual Consistency in Unified Models
Zixuan Ye, Quande Liu, Cong Wei, Yuanxing Zhang, Xintao Wang, Pengfei Wan, Kun Gai, Wenhan Luo

TL;DR
This paper introduces Visual-Aware CoT, a method that enhances multi-modal model generation by explicitly maintaining visual context consistency through structured planning and iterative correction, leading to improved visual fidelity.
Contribution
It proposes a novel approach integrating visual context consistency into reasoning, using visual planning and correction, with supervised fine-tuning and flow-GRPO for better multi-modal generation.
Findings
Outperforms zero-shot models in visual consistency
Improves multi-modal generation quality
Enhances visual feature preservation
Abstract
Recently, the introduction of Chain-of-Thought (CoT) has largely improved the generation ability of unified models. However, it is observed that the current thinking process during generation mainly focuses on the text consistency with the text prompt, ignoring the \textbf{visual context consistency} with the visual reference images during the multi-modal generation, e.g., multi-reference generation. The lack of such consistency results in the failure in maintaining key visual features (like human ID, object attribute, style). To this end, we integrate the visual context consistency into the reasoning of unified models, explicitly motivating the model to sustain such consistency by 1) Adaptive Visual Planning: generating structured visual check list to figure out the visual element of needed consistency keeping, and 2) Iterative Visual Correction: performing self-reflection with the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Data Visualization and Analytics
