TextAlign: Preference Alignment for Text Rendering with Hierarchical Rewards

Mingxuan Cui; Jingpu Yang; Fengxian Ji; Qian Jiang; Zhecheng Shi; Jiaming Wang; Zirui Song; Fajri Koto; Xiuying Chen

arXiv:2605.19320·cs.CV·May 20, 2026

TextAlign: Preference Alignment for Text Rendering with Hierarchical Rewards

Mingxuan Cui, Jingpu Yang, Fengxian Ji, Qian Jiang, Zhecheng Shi, Jiaming Wang, Zirui Song, Fajri Koto, Xiuying Chen

PDF

TL;DR

TextAlign introduces a hierarchical reward-based framework to improve text rendering accuracy in image generation models without altering their architecture.

Contribution

It proposes a scalable, non-invasive preference-alignment method using hierarchical rewards to enhance text rendering in large models.

Findings

01

Consistent improvements in OCR-based text accuracy on benchmark datasets.

02

Outperforms existing baselines like SD3.5, Qwen-Image, and TextDiffuser.

03

Enhances text rendering without degrading overall image quality.

Abstract

Faithful text rendering remains a persistent weakness of large text-to-image generative models, as it requires both semantic instruction following and fine-grained glyph-level structure. Prior methods often improve this ability through architecture-specific modules or encoder modifications, which complicate deployment across foundation models. We study text rendering as a post-training preference-alignment problem and propose TextAlign, a non-invasive framework that keeps the generator architecture unchanged. The key component is a hierarchical vision-language model (VLM)-based reward that decomposes rendering errors into global, word, and glyph levels, then converts binary defect judgments into a scalar preference signal. The resulting signal supports both Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO). Experiments on FLUX.1-dev and Z-Image-Turbo…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.