TL;DR
TextFlux is a novel OCR-free diffusion-based model for high-fidelity, multilingual scene text synthesis that requires less data and offers flexible, controllable multi-line text generation.
Contribution
It introduces TextFlux, a DiT-based framework that eliminates OCR modules, enhances multilingual scalability, reduces training data needs, and enables precise multi-line text control.
Findings
Outperforms previous methods in qualitative evaluations.
Effective in low-resource multilingual settings.
Requires only 1% of the training data compared to competitors.
Abstract
Diffusion-based scene text synthesis has progressed rapidly, yet existing methods commonly rely on additional visual conditioning modules and require large-scale annotated data to support multilingual generation. In this work, we revisit the necessity of complex auxiliary modules and further explore an approach that simultaneously ensures glyph accuracy and achieves high-fidelity scene integration, by leveraging diffusion models' inherent capabilities for contextual reasoning. To this end, we introduce TextFlux, a DiT-based framework that enables multilingual scene text synthesis. The advantages of TextFlux can be summarized as follows: (1) OCR-free model architecture. TextFlux eliminates the need for OCR encoders (additional visual conditioning modules) that are specifically used to extract visual text-related features. (2) Strong multilingual scalability. TextFlux is effective in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsDiffusion
