Radically Lower Data-Labeling Costs for Visually Rich Document Extraction Models
Yichao Zhou, James B. Wendt, Navneet Potti, Jing Xie, Sandeep Tata

TL;DR
This paper introduces Selective Labeling and an active learning strategy to significantly reduce data-labeling costs for visually rich document extraction models, achieving up to tenfold cost savings with minimal accuracy loss.
Contribution
The paper presents a novel selective labeling approach combined with active learning to drastically lower labeling costs for document extraction models.
Findings
Reduced labeling costs by up to 10x
Negligible accuracy loss with the new method
Effective across multiple document domains
Abstract
A key bottleneck in building automatic extraction models for visually rich documents like invoices is the cost of acquiring the several thousand high-quality labeled documents that are needed to train a model with acceptable accuracy. We propose Selective Labeling to simplify the labeling task to provide "yes/no" labels for candidate extractions predicted by a model trained on partially labeled documents. We combine this with a custom active learning strategy to find the predictions that the model is most uncertain about. We show through experiments on document types drawn from 3 different domains that selective labeling can reduce the cost of acquiring labeled data by with a negligible loss in accuracy.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Image Processing and 3D Reconstruction · Image Retrieval and Classification Techniques
