Beyond Isolated Words: Diffusion Brush for Handwritten Text-Line Generation

Gang Dai; Yifan Zhang; Yutao Qin; Qiangya Guo; Shuangping Huang; Shuicheng Yan

arXiv:2508.03256·cs.CV·August 6, 2025

Beyond Isolated Words: Diffusion Brush for Handwritten Text-Line Generation

Gang Dai, Yifan Zhang, Yutao Qin, Qiangya Guo, Shuangping Huang, Shuicheng Yan

PDF

5 Reviews

TL;DR

This paper introduces DiffBrush, a diffusion-based model that generates realistic handwritten text lines by capturing complex style patterns and ensuring content accuracy, advancing beyond isolated word generation methods.

Contribution

DiffBrush is a novel diffusion model that disentangles style and content, enabling high-quality, coherent handwritten text line generation with improved style imitation and content preservation.

Findings

01

Outperforms existing methods in style reproduction.

02

Maintains high content accuracy across generated lines.

03

Produces realistic, coherent handwritten text lines.

Abstract

Existing handwritten text generation methods primarily focus on isolated words. However, realistic handwritten text demands attention not only to individual words but also to the relationships between them, such as vertical alignment and horizontal spacing. Therefore, generating entire text lines emerges as a more promising and comprehensive task. However, this task poses significant challenges, including the accurate modeling of complex style patterns encompassing both intra- and inter-word relationships, and maintaining content accuracy across numerous characters. To address these challenges, we propose DiffBrush, a novel diffusion-based model for handwritten text-line generation. Unlike existing methods, DiffBrush excels in both style imitation and content accuracy through two key strategies: (1) content-decoupled style learning, which disentangles style from content to better…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 8Confidence 4

Strengths

- The model extends beyond the generation of isolated words to the generation of full text lines, which is crucial for real-world applications such as synthetic data generation. - Extensive testing is performed on two different datasets in English and German. - The evaluation is robust, using three different sets of metrics that assess feature-based, OCR and visual quality aspects. - Competing models are retrained to ensure a fair and direct comparison. - The proposed method demonstrates signifi

Weaknesses

- Little information is provided about the OCR system used, although this is a key evaluation metric. This raises the question of whether the OCR could be specifically designed to favour the proposed generation method. - The data sets used for the experiments are relatively simple and somewhat artificial, consisting of non-spontaneous writing with isolated words on a white background. A demonstration of the model's generalisability to more realistic, complex use cases would have strengthened the

Reviewer 02Rating 3Confidence 4

Strengths

1. Authors focus on the handwritten text generation in the wild. The work decomposes text-line content preservation across numerous characters into global context supervision between characters and local supervision of individual character structures. 2. A lot of experiments are conducted to support the proposed method, which includes two widely-used handwritten datasets. 3. Authors consider more baselines, which is effective and reasonable. 4. The paper exhibits some good figures, whic

Weaknesses

1. I think the presentation in this paper is not good. Like ' It is non-trivial to accurately capture writing styles from text-lines with multiple words, as it involves not only intra-word style patterns like letter connections and slant but also inter-word spacing and vertical alignment' , In this paper , there are more sentences which is not readable. 2. It is hard to follow the story. The main idea is not easy to grasp when I try to read both introduction and the method sections. 3. Auth

Reviewer 03Rating 3Confidence 4

Strengths

(1) This paper introduces a diffusion model for generating handwritten text lines that adeptly captures both vertical and horizontal writing styles, using dual-level discriminators to ensure content accuracy. However, the approach encodes handwriting style in terms of intra- and inter-word spacing, which is somewhat unusual. Additionally, the encoding of styles through space with column sampling and the design of the proxy anchor loss are unclear, as noted in the weaknesses section.

Weaknesses

(1) The formulation of \( L_{ver} \) and \( L_{hor} \) as proxy anchor losses is somewhat unclear. These losses appear to assume uniform spacing between words; however, this spacing is often inconsistent for a given writer. For instance, in Figure 2(a) (top row), the space between the first two words differs from that between the last two words. Furthermore, if this focuses solely on word-to-word spacing, how is character-to-character spacing addressed? (2) The proposed method explicitly focuse

Reviewer 04Rating 5Confidence 5

Strengths

The authors evaluate DiffBrush using multiple quantitative metrics, such as "Handwriting Distance (HWD)" for style fidelity, "Character Error Rate (CER)" and "Word Error Rate (WER)" for content accuracy, and image quality metrics (FID, IS).

Weaknesses

1. Not much discussion is available on the interpretability of the learned style space. It is not clear how distinct are the learned vertical and horizontal style representations, and how do they vary across writers? Visualizations of the learned style features could enhance understanding and trust in the model’s style-capturing ability. 2. More detailed explanation on how procurement of two style representations to clearly explain how they are different from the method proposed in ONE-DM i

Reviewer 05Rating 5Confidence 4

Strengths

Originality: While there have previously been works on the generation of images of handwritten lines, using GANs and diffusion models for generation of images of handwritten words, this is (one of the) first to combine all 3. Clarity & quality: The writing is clear, there is some amount of ablation studies to highlight the importance of the proposed components (namely the specific approach to style extraction module and the need for two different discriminators). The human preference study is al

Weaknesses

* Measuring the effect of the proposed ideas. The paper proposes the style extraction model with very set of biases, but the effect of it compared to a much simpler model that simply passes information from the style source image is not measured. Furthermore, the effect of the style model on the recognizability of the generated image seems very small as per Figure 6. Having more of such ablations would strengthen the paper. * Generalization of the model to other data. The IAM and CVL datasets b

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.