Beyond Human Annotation: Recent Advances in Data Generation Methods for Document Intelligence
Dehao Ying, Fengchang Yu, Haihua Chen, Changjiang Jiang, Yurong Li, and Wei Lu

TL;DR
This paper provides a comprehensive overview of recent data generation methods for Document Intelligence, proposing a unified framework and evaluation approach to address challenges and advance the field.
Contribution
It introduces a novel taxonomy and evaluation framework for data generation in DI, unifying diverse methodologies and highlighting key challenges and future directions.
Findings
Organized data generation methods into four resource-centric paradigms.
Established a multi-level evaluation framework for assessing data quality and utility.
Identified critical challenges like fidelity gaps and co-evolutionary ecosystems.
Abstract
The advancement of Document Intelligence (DI) demands large-scale, high-quality training data, yet manual annotation remains a critical bottleneck. While data generation methods are evolving rapidly, existing surveys are constrained by fragmented focuses on single modalities or specific tasks, lacking a unified perspective aligned with real-world workflows. To fill this gap, this survey establishes the first comprehensive technical map for data generation in DI. Data generation is redefined as supervisory signal production, and a novel taxonomy is introduced based on the "availability of data and labels." This framework organizes methodologies into four resource-centric paradigms: Data Augmentation, Data Generation from Scratch, Automated Data Annotation, and Self-Supervised Signal Construction. Furthermore, a multi-level evaluation framework is established to integrate intrinsic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Data Quality and Management · Natural Language Processing Techniques
