DocSpiral: A Platform for Integrated Assistive Document Annotation through Human-in-the-Spiral
Qiang Sun, Sirui Li, Tingting Bi, Du Huynh, Mark Reynolds, Yuanyi Luo,, Wei Liu

TL;DR
DocSpiral is an innovative human-in-the-loop platform that streamlines the annotation of image-based documents, reducing manual effort and enhancing model training for structured data extraction in specialized fields.
Contribution
It introduces a novel spiral, iterative annotation framework that integrates normalization, annotation, evaluation, and API tools into a unified workflow for document processing.
Findings
Reduces annotation time by at least 41%
Achieves consistent performance improvements across iterations
Facilitates AI/ML development in domain-specific document fields
Abstract
Acquiring structured data from domain-specific, image-based documents such as scanned reports is crucial for many downstream tasks but remains challenging due to document variability. Many of these documents exist as images rather than as machine-readable text, which requires human annotation to train automated extraction systems. We present DocSpiral, the first Human-in-the-Spiral assistive document annotation platform, designed to address the challenge of extracting structured information from domain-specific, image-based document collections. Our spiral design establishes an iterative cycle in which human annotations train models that progressively require less manual intervention. DocSpiral integrates document format normalization, comprehensive annotation interfaces, evaluation metrics dashboard, and API endpoints for the development of AI / ML models into a unified workflow.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
