Radically Lower Data-Labeling Costs for Visually Rich Document   Extraction Models

Yichao Zhou; James B. Wendt; Navneet Potti; Jing Xie; Sandeep Tata

arXiv:2210.16391·cs.CL·November 1, 2022·1 cites

Radically Lower Data-Labeling Costs for Visually Rich Document Extraction Models

Yichao Zhou, James B. Wendt, Navneet Potti, Jing Xie, Sandeep Tata

PDF

Open Access

TL;DR

This paper introduces Selective Labeling and an active learning strategy to significantly reduce data-labeling costs for visually rich document extraction models, achieving up to tenfold cost savings with minimal accuracy loss.

Contribution

The paper presents a novel selective labeling approach combined with active learning to drastically lower labeling costs for document extraction models.

Findings

01

Reduced labeling costs by up to 10x

02

Negligible accuracy loss with the new method

03

Effective across multiple document domains

Abstract

A key bottleneck in building automatic extraction models for visually rich documents like invoices is the cost of acquiring the several thousand high-quality labeled documents that are needed to train a model with acceptable accuracy. We propose Selective Labeling to simplify the labeling task to provide "yes/no" labels for candidate extractions predicted by a model trained on partially labeled documents. We combine this with a custom active learning strategy to find the predictions that the model is most uncertain about. We show through experiments on document types drawn from 3 different domains that selective labeling can reduce the cost of acquiring labeled data by $10 \times$ with a negligible loss in accuracy.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Image Processing and 3D Reconstruction · Image Retrieval and Classification Techniques