LumiGen: An LVLM-Enhanced Iterative Framework for Fine-Grained Text-to-Image Generation
Xiaoqi Dong, Xiangyu Zhou, Nicholas Evans, Yujia Lin

TL;DR
LumiGen introduces an LVLM-enhanced iterative framework that significantly improves fine-grained text-to-image generation by incorporating feedback mechanisms for better content control and semantic accuracy.
Contribution
The paper presents a novel LVLM-driven iterative framework with feedback loops and prompt augmentation to enhance T2I models' fine-grained control and semantic consistency.
Findings
Achieves a higher average score of 3.08 on LongBench-T2I Benchmark.
Significantly improves text rendering accuracy.
Enhances pose expression and compositional coherence.
Abstract
Text-to-Image (T2I) generation has made significant advancements with diffusion models, yet challenges persist in handling complex instructions, ensuring fine-grained content control, and maintaining deep semantic consistency. Existing T2I models often struggle with tasks like accurate text rendering, precise pose generation, or intricate compositional coherence. Concurrently, Vision-Language Models (LVLMs) have demonstrated powerful capabilities in cross-modal understanding and instruction following. We propose LumiGen, a novel LVLM-enhanced iterative framework designed to elevate T2I model performance, particularly in areas requiring fine-grained control, through a closed-loop, LVLM-driven feedback mechanism. LumiGen comprises an Intelligent Prompt Parsing & Augmentation (IPPA) module for proactive prompt enhancement and an Iterative Visual Feedback & Refinement (IVFR) module, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
