Field Extraction from Forms with Unlabeled Data
Mingfei Gao, Zeyuan Chen, Nikhil Naik, Kazuma Hashimoto, Caiming, Xiong, Ran Xu

TL;DR
This paper introduces a new framework for extracting form fields from unlabeled data by leveraging pseudo-labels and a transformer model, improving accuracy through a progressive refinement process.
Contribution
It presents a novel combination of rule-based pseudo-label mining and a transformer-based extraction model with a refinement module for unlabeled form data.
Findings
Effective field extraction demonstrated on unlabeled forms
Pseudo-label refinement improves model robustness
Transformer-based representations enhance extraction accuracy
Abstract
We propose a novel framework to conduct field extraction from forms with unlabeled data. To bootstrap the training process, we develop a rule-based method for mining noisy pseudo-labels from unlabeled forms. Using the supervisory signal from the pseudo-labels, we extract a discriminative token representation from a transformer-based model by modeling the interaction between text in the form. To prevent the model from overfitting to label noise, we introduce a refinement module based on a progressive pseudo-label ensemble. Experimental results demonstrate the effectiveness of our framework.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Image Processing and 3D Reconstruction · Music and Audio Processing
