TL;DR
This paper introduces FiMR, a framework that enhances text-to-image generation by decomposing prompts into semantic units, verifying and refining each via multimodal reasoning for more precise image alignment.
Contribution
FiMR leverages fine-grained multimodal reasoning with VQA to improve image prompt understanding and generation quality, especially for complex, compositional prompts.
Findings
FiMR outperforms existing reasoning-based image generation methods.
It achieves more accurate image-prompt alignment on compositional benchmarks.
Extensive experiments validate the effectiveness of FiMR in fine-grained control.
Abstract
With the rapid progress of Multimodal Large Language Models (MLLMs), unified MLLMs that jointly perform image understanding and generation have advanced significantly. However, despite the inherent reasoning capabilities of unified MLLMs for self-reflection and self-refinement, their use in text-to-image generation remains largely underexplored. Meanwhile, existing multimodal reasoning-based image generation methods mostly rely on holistic image-text alignment judgments, without fine-grained reflection and refinement of detailed prompt attributes, leading to limited fine-grained control. Therefore, we propose Fine-grained Multimodal Reasoning (FiMR), a framework that leverages decomposed visual question answering (VQA) to break down an input prompt into minimal semantic units-such as entities and attributes-and verify each unit via VQA to generate explicit, fine-grained feedback. Based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
