Information Extraction from Heterogeneous Documents without Ground Truth Labels using Synthetic Label Generation and Knowledge Distillation
Aniket Bhattacharyya, Anurag Tripathi

TL;DR
This paper introduces TAIL, a method for generating synthetic labels for heterogeneous visual-rich documents, enabling effective training of document understanding models without ground truth labels, and demonstrates its efficiency and accuracy in real-world expense document extraction.
Contribution
The paper presents TAIL, a novel synthetic label generation technique combined with knowledge distillation, to train multimodal document understanding models without relying on labeled datasets.
Findings
Achieves comparable performance to Claude 3 Sonnet on benchmark datasets.
Outperforms state-of-the-art layout-aware models by over 10% in ANLS scores.
Is 85% less costly and approximately 5 times faster than large multimodal models.
Abstract
Invoices and receipts submitted by employees are visually rich documents (VRDs) with textual, visual and layout information. To protect against the risk of fraud and abuse, it is crucial for organizations to efficiently extract desired information from submitted receipts. This helps in the assessment of key factors such as appropriateness of the expense claim, adherence to spending and transaction policies, the validity of the receipt, as well as downstream anomaly detection at various levels. These documents are heterogeneous, with multiple formats and languages, uploaded with different image qualities, and often do not contain ground truth labels for the efficient training of models. In this paper we propose Task Aware Instruction-based Labelling (TAIL), a method for synthetic label generation in VRD corpuses without labels, and fine-tune a multimodal Visually Rich Document…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Applications and Data Management · Advanced Computational Techniques and Applications · Web Data Mining and Analysis
MethodsKnowledge Distillation · Attentive Walk-Aggregating Graph Neural Network
