LumiGen: An LVLM-Enhanced Iterative Framework for Fine-Grained Text-to-Image Generation

Xiaoqi Dong; Xiangyu Zhou; Nicholas Evans; Yujia Lin

arXiv:2508.04732·cs.LG·August 8, 2025

LumiGen: An LVLM-Enhanced Iterative Framework for Fine-Grained Text-to-Image Generation

Xiaoqi Dong, Xiangyu Zhou, Nicholas Evans, Yujia Lin

PDF

TL;DR

LumiGen introduces an LVLM-enhanced iterative framework that significantly improves fine-grained text-to-image generation by incorporating feedback mechanisms for better content control and semantic accuracy.

Contribution

The paper presents a novel LVLM-driven iterative framework with feedback loops and prompt augmentation to enhance T2I models' fine-grained control and semantic consistency.

Findings

01

Achieves a higher average score of 3.08 on LongBench-T2I Benchmark.

02

Significantly improves text rendering accuracy.

03

Enhances pose expression and compositional coherence.

Abstract

Text-to-Image (T2I) generation has made significant advancements with diffusion models, yet challenges persist in handling complex instructions, ensuring fine-grained content control, and maintaining deep semantic consistency. Existing T2I models often struggle with tasks like accurate text rendering, precise pose generation, or intricate compositional coherence. Concurrently, Vision-Language Models (LVLMs) have demonstrated powerful capabilities in cross-modal understanding and instruction following. We propose LumiGen, a novel LVLM-enhanced iterative framework designed to elevate T2I model performance, particularly in areas requiring fine-grained control, through a closed-loop, LVLM-driven feedback mechanism. LumiGen comprises an Intelligent Prompt Parsing & Augmentation (IPPA) module for proactive prompt enhancement and an Iterative Visual Feedback & Refinement (IVFR) module, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.