FLUX-Text: A Simple and Advanced Diffusion Transformer Baseline for Scene Text Editing

Rui Lan; Yancheng Bai; Xu Duan; Mingxing Li; Dongyang Jin; Ryan Xu; Dong Nie; Lei Sun; Xiangxiang Chu

arXiv:2505.03329·cs.CV·November 21, 2025

FLUX-Text: A Simple and Advanced Diffusion Transformer Baseline for Scene Text Editing

Rui Lan, Yancheng Bai, Xu Duan, Mingxing Li, Dongyang Jin, Ryan Xu, Dong Nie, Lei Sun, Xiangxiang Chu

PDF

Open Access 2 Models

TL;DR

FLUX-Text introduces a lightweight, multilingual diffusion transformer for scene text editing that significantly improves glyph understanding and reduces training data requirements while maintaining high visual quality.

Contribution

The paper presents FLUX-Text, a novel diffusion transformer model with lightweight modules and a regional perceptual loss, enabling effective multilingual scene text editing with minimal training data.

Findings

01

Outperforms existing methods in visual quality and text fidelity.

02

Requires only 0.1M training examples, a 97% reduction from previous methods.

03

Effective on English and Chinese benchmarks.

Abstract

Scene text editing aims to modify or add texts on images while ensuring text fidelity and overall visual quality consistent with the background. Recent methods are primarily built on UNet-based diffusion models, which have improved scene text editing results, but still struggle with complex glyph structures, especially for non-Latin ones (\eg, Chinese, Korean, Japanese). To address these issues, we present \textbf{FLUX-Text}, a simple and advanced multilingual scene text editing DiT method. Specifically, our FLUX-Text enhances glyph understanding and generation through lightweight Visual and Text Embedding Modules, while preserving the original generative capability of FLUX. We further propose a Regional Text Perceptual Loss tailored for text regions, along with a matching two-stage training strategy to better balance text editing and overall image quality. Benefiting from the DiT-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Power Systems and Technologies

MethodsDiffusion