WriteViT: Handwritten Text Generation with Vision Transformer

Dang Hoai Nam; Huynh Tong Dang Khoa; Vo Nguyen Le Duy

arXiv:2505.13235·cs.CV·May 20, 2025

WriteViT: Handwritten Text Generation with Vision Transformer

Dang Hoai Nam, Huynh Tong Dang Khoa, Vo Nguyen Le Duy

PDF

Open Access 1 Repo

TL;DR

WriteViT is a novel transformer-based framework for one-shot handwritten text synthesis that effectively captures style and content, especially in low-resource multilingual settings like Vietnamese and English.

Contribution

It introduces a transformer-based architecture for handwriting generation, integrating style extraction, multi-scale generation, and recognition, advancing beyond CNN-based methods.

Findings

01

Produces high-quality, style-consistent handwriting

02

Effective in low-resource multilingual scenarios

03

Maintains strong recognition performance

Abstract

Humans can quickly generalize handwriting styles from a single example by intuitively separating content from style. Machines, however, struggle with this task, especially in low-data settings, often missing subtle spatial and stylistic cues. Motivated by this gap, we introduce WriteViT, a one-shot handwritten text synthesis framework that incorporates Vision Transformers (ViT), a family of models that have shown strong performance across various computer vision tasks. WriteViT integrates a ViT-based Writer Identifier for extracting style embeddings, a multi-scale generator built with Transformer encoder-decoder blocks enhanced by conditional positional encoding (CPE), and a lightweight ViT-based recognizer. While previous methods typically rely on CNNs or CRNNs, our design leverages transformers in key components to better capture both fine-grained stroke details and higher-level style…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hnam-1765/writevit
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Topic Modeling · Generative Adversarial Networks and Image Synthesis

MethodsAttention Is All You Need · Depthwise Convolution · Positional Encoding Generator · Linear Layer · Multi-Head Attention · Dense Connections · Layer Normalization · Byte Pair Encoding · Conditional Positional Encoding · Label Smoothing