Free Lunch for Unified Multimodal Models: Enhancing Generation via Reflective Rectification with Inherent Understanding
Yibo Jiang, Tao Wu, Rui Jiang, Yehao Lu, Chaoxiang Cai, Zequn Qin, Xi Li

TL;DR
This paper introduces UniRect-CoT, a training-free framework that leverages the inherent understanding of unified multimodal models to improve their generation quality through reflective rectification inspired by human reasoning.
Contribution
It proposes a novel, training-free rectification method that activates internal knowledge during generation, significantly enhancing UMMs' performance on complex tasks.
Findings
UniRect-CoT improves generation quality across diverse tasks.
The method aligns intermediate results with target instructions for self-supervision.
It can be integrated into existing UMMs without additional training.
Abstract
Unified Multimodal Models (UMMs) aim to integrate visual understanding and generation within a single structure. However, these models exhibit a notable capability mismatch, where their understanding capability significantly outperforms their generation. This mismatch indicates that the model's rich internal knowledge, while effective for understanding tasks, remains underactivated during generation. To address this, we draw inspiration from the human ``Thinking-While-Drawing'' paradigm, where humans continuously reflect to activate their knowledge and rectify intermediate results. In this paper, we propose UniRect-CoT, a training-free unified rectification chain-of-thought framework. Our approach unlocks the ``free lunch'' hidden in the UMM's powerful inherent understanding to continuously reflect, activating its internal knowledge and rectifying intermediate results during…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
