Enhanced Text-to-Image Generation by Fine-grained Multimodal Reasoning

Yongjin Kim; Yoonjin Oh; Yerin Kim; Hyomin Kim; Jeeyoung Yun; Yujung Heo; Minjun Kim; Sungwoong Kim

arXiv:2604.13491·cs.CV·April 17, 2026

Enhanced Text-to-Image Generation by Fine-grained Multimodal Reasoning

Yongjin Kim, Yoonjin Oh, Yerin Kim, Hyomin Kim, Jeeyoung Yun, Yujung Heo, Minjun Kim, Sungwoong Kim

PDF

1 Repo

TL;DR

This paper introduces FiMR, a framework that enhances text-to-image generation by decomposing prompts into semantic units, verifying and refining each via multimodal reasoning for more precise image alignment.

Contribution

FiMR leverages fine-grained multimodal reasoning with VQA to improve image prompt understanding and generation quality, especially for complex, compositional prompts.

Findings

01

FiMR outperforms existing reasoning-based image generation methods.

02

It achieves more accurate image-prompt alignment on compositional benchmarks.

03

Extensive experiments validate the effectiveness of FiMR in fine-grained control.

Abstract

With the rapid progress of Multimodal Large Language Models (MLLMs), unified MLLMs that jointly perform image understanding and generation have advanced significantly. However, despite the inherent reasoning capabilities of unified MLLMs for self-reflection and self-refinement, their use in text-to-image generation remains largely underexplored. Meanwhile, existing multimodal reasoning-based image generation methods mostly rely on holistic image-text alignment judgments, without fine-grained reflection and refinement of detailed prompt attributes, leading to limited fine-grained control. Therefore, we propose Fine-grained Multimodal Reasoning (FiMR), a framework that leverages decomposed visual question answering (VQA) to break down an input prompt into minimal semantic units-such as entities and attributes-and verify each unit via VQA to generate explicit, fine-grained feedback. Based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

KU-AGI/FiMR
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.