Large Language Models are Universal Reasoners for Visual Generation
Sucheng Ren, Chen Chen, Zhenbang Wang, Liangchen Song, Xiangxin Zhu, Alan Yuille, Liang-Chieh Chen, Jiasen Lu

TL;DR
This paper introduces UniReasoner, a framework that uses large language models as universal reasoners to improve the alignment and faithfulness of text-to-image generation by generating visual drafts and grounded evaluations.
Contribution
It formalizes the understanding-generation gap and proposes a novel method where LLMs guide diffusion models through visual drafts and self-critique for better prompt adherence.
Findings
Improves compositional alignment and semantic faithfulness in generated images.
Maintains image quality while enhancing prompt adherence.
Leverages LLM reasoning to close the understanding-generation gap.
Abstract
Text-to-image generation has advanced rapidly with diffusion models, progressing from CLIP and T5 conditioning to unified systems where a single LLM backbone handles both visual understanding and generation. Despite the architectural unification, these systems frequently fail to faithfully align complex prompts during synthesis, even though they remain highly accurate at verifying whether an image satisfies those same prompts. We formalize this as the \emph{understanding-generation gap} and propose UniReasoner, a framework that leverages the LLM as a universal reasoner to convert its understanding strength into direct generation guidance. Given a prompt, the LLM first produces a coarse visual draft composed of discrete vision tokens. It then performs a self-critique by evaluating the draft for prompt consistency, producing a grounded textual evaluation that pinpoints what needs to be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
