TextGround4M: A Prompt-Aligned Dataset for Layout-Aware Text Rendering
Dongxing Mao, Yilin Wang, Linjie Li, Zhengyuan Yang, Alex Jinpeng Wang

TL;DR
This paper introduces TextGround4M, a large dataset with detailed layout annotations, and proposes a training strategy and metrics to improve spatially accurate, prompt-grounded text rendering in image generation models.
Contribution
The paper provides a large-scale dataset, a novel training approach, and new evaluation metrics to enhance layout-aware text rendering in text-to-image models.
Findings
Models trained on TextGround4M outperform baselines in text fidelity.
The approach improves spatial accuracy and prompt consistency.
New metrics effectively evaluate spatial layout quality.
Abstract
Despite recent advances in text-to-image generation, models still struggle to accurately render prompt-specified text with correct spatial layout -- especially in multi-span, structured settings. This challenge is driven not only by the lack of datasets that align prompts with the exact text and layout expected in the image, but also by the absence of effective metrics for evaluating layout quality. To address these issues, we introduce TextGround4M, a large-scale dataset of over 4 million prompt-image pairs, each annotated with span-level text grounded in the prompt and corresponding bounding boxes. This enables fine-grained supervision for layout-aware, prompt-grounded text rendering. Building on this, we propose a lightweight training strategy for autoregressive T2I models that appends layout-aware span tokens during training, without altering model architecture or inference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
