EasyText: Controllable Diffusion Transformer for Multilingual Text Rendering
Runnan Lu, Yuxuan Zhang, Jiaming Liu, Haofan Wang, Yiren Song

TL;DR
EasyText introduces a diffusion transformer-based framework for controllable, high-quality multilingual text rendering, leveraging large-scale synthetic datasets and novel encoding techniques to improve accuracy and layout control.
Contribution
The paper presents EasyText, a novel diffusion transformer framework with character positioning encoding and interpolation for precise multilingual text rendering.
Findings
Effective multilingual text rendering demonstrated
High visual quality and layout control achieved
Large-scale synthetic datasets enhance training
Abstract
Generating accurate multilingual text with diffusion models has long been desired but remains challenging. Recent methods have made progress in rendering text in a single language, but rendering arbitrary languages is still an unexplored area. This paper introduces EasyText, a text rendering framework based on DiT (Diffusion Transformer), which connects denoising latents with multilingual character tokens encoded as character tokens. We propose character positioning encoding and position encoding interpolation techniques to achieve controllable and precise text rendering. Additionally, we construct a large-scale synthetic text image dataset with 1 million multilingual image-text annotations as well as a high-quality dataset of 20K annotated images, which are used for pretraining and fine-tuning respectively. Extensive experiments and evaluations demonstrate the effectiveness and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsHuman Motion and Animation · Computer Graphics and Visualization Techniques · Video Analysis and Summarization
MethodsDiffusion
