DocDjinn: Controllable Synthetic Document Generation with VLMs and Handwriting Diffusion

Marcel Lamott; Saifullah Saifullah; Nauman Riaz; Yves-Noel Weweler; Tobias Alt-Veit; Ahmad Sarmad Ali; Muhammad Armaghan Shakir; Adrian Kalwa; Momina Moetesum; Andreas Dengel; Sheraz Ahmed; Faisal Shafait; Ulrich Schwanecke; Adrian Ulges

arXiv:2602.21824·cs.LG·February 26, 2026

DocDjinn: Controllable Synthetic Document Generation with VLMs and Handwriting Diffusion

Marcel Lamott, Saifullah Saifullah, Nauman Riaz, Yves-Noel Weweler, Tobias Alt-Veit, Ahmad Sarmad Ali, Muhammad Armaghan Shakir, Adrian Kalwa, Momina Moetesum, Andreas Dengel, Sheraz Ahmed, Faisal Shafait, Ulrich Schwanecke, Adrian Ulges

PDF

Open Access

TL;DR

DocDjinn introduces a controllable synthetic document generation framework using Vision-Language Models and handwriting diffusion, producing high-quality, annotated documents that effectively augment real datasets for various document understanding tasks.

Contribution

This work is the first to demonstrate VLMs generating faithful, annotated synthetic documents at scale from unlabeled seeds, improving data efficiency for document intelligence models.

Findings

01

Achieves 87% of real dataset performance with only 100 real samples.

02

Generates diverse, high-quality synthetic documents with realistic handwriting.

03

Effective across multiple benchmarks including information extraction and document classification.

Abstract

Effective document intelligence models rely on large amounts of annotated training data. However, procuring sufficient and high-quality data poses significant challenges due to the labor-intensive and costly nature of data acquisition. Additionally, leveraging language models to annotate real documents raises concerns about data privacy. Synthetic document generation has emerged as a promising, privacy-preserving alternative. We propose DocDjinn, a novel framework for controllable synthetic document generation using Vision-Language Models (VLMs) that produces annotated documents from unlabeled seed samples. Our approach generates visually plausible and semantically consistent synthetic documents that follow the distribution of an existing source dataset through clustering-based seed selection with parametrized sampling. By enriching documents with realistic diffusion-based handwriting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Handwritten Text Recognition Techniques