WriteViT: Handwritten Text Generation with Vision Transformer
Dang Hoai Nam, Huynh Tong Dang Khoa, Vo Nguyen Le Duy

TL;DR
WriteViT is a novel transformer-based framework for one-shot handwritten text synthesis that effectively captures style and content, especially in low-resource multilingual settings like Vietnamese and English.
Contribution
It introduces a transformer-based architecture for handwriting generation, integrating style extraction, multi-scale generation, and recognition, advancing beyond CNN-based methods.
Findings
Produces high-quality, style-consistent handwriting
Effective in low-resource multilingual scenarios
Maintains strong recognition performance
Abstract
Humans can quickly generalize handwriting styles from a single example by intuitively separating content from style. Machines, however, struggle with this task, especially in low-data settings, often missing subtle spatial and stylistic cues. Motivated by this gap, we introduce WriteViT, a one-shot handwritten text synthesis framework that incorporates Vision Transformers (ViT), a family of models that have shown strong performance across various computer vision tasks. WriteViT integrates a ViT-based Writer Identifier for extracting style embeddings, a multi-scale generator built with Transformer encoder-decoder blocks enhanced by conditional positional encoding (CPE), and a lightweight ViT-based recognizer. While previous methods typically rely on CNNs or CRNNs, our design leverages transformers in key components to better capture both fine-grained stroke details and higher-level style…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Topic Modeling · Generative Adversarial Networks and Image Synthesis
MethodsAttention Is All You Need · Depthwise Convolution · Positional Encoding Generator · Linear Layer · Multi-Head Attention · Dense Connections · Layer Normalization · Byte Pair Encoding · Conditional Positional Encoding · Label Smoothing
