TextFlux: An OCR-Free DiT Model for High-Fidelity Multilingual Scene Text Synthesis

Yu Xie; Jielei Zhang; Pengyu Chen; Weihang Wang; Longwen Gao; Peiyi Li; Qian Qiao; and Zhouhui Lian

arXiv:2505.17778·cs.CV·March 13, 2026

TextFlux: An OCR-Free DiT Model for High-Fidelity Multilingual Scene Text Synthesis

Yu Xie, Jielei Zhang, Pengyu Chen, Weihang Wang, Longwen Gao, Peiyi Li, Qian Qiao, and Zhouhui Lian

PDF

4 Models

TL;DR

TextFlux is a novel OCR-free diffusion-based model for high-fidelity, multilingual scene text synthesis that requires less data and offers flexible, controllable multi-line text generation.

Contribution

It introduces TextFlux, a DiT-based framework that eliminates OCR modules, enhances multilingual scalability, reduces training data needs, and enables precise multi-line text control.

Findings

01

Outperforms previous methods in qualitative evaluations.

02

Effective in low-resource multilingual settings.

03

Requires only 1% of the training data compared to competitors.

Abstract

Diffusion-based scene text synthesis has progressed rapidly, yet existing methods commonly rely on additional visual conditioning modules and require large-scale annotated data to support multilingual generation. In this work, we revisit the necessity of complex auxiliary modules and further explore an approach that simultaneously ensures glyph accuracy and achieves high-fidelity scene integration, by leveraging diffusion models' inherent capabilities for contextual reasoning. To this end, we introduce TextFlux, a DiT-based framework that enables multilingual scene text synthesis. The advantages of TextFlux can be summarized as follows: (1) OCR-free model architecture. TextFlux eliminates the need for OCR encoders (additional visual conditioning modules) that are specifically used to extract visual text-related features. (2) Strong multilingual scalability. TextFlux is effective in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsDiffusion