TL;DR
QUEST is a novel semi-supervised table extraction framework that uses a quality assessment model to improve accuracy and reduce errors in business document processing, leveraging unlabeled data effectively.
Contribution
It introduces a quality-aware pseudo-labeling approach with a new model predicting F1 scores, enhancing semi-supervised table extraction for business documents.
Findings
F1 score improved from 64% to 74% on proprietary data.
Reduces empty predictions by 45%.
Achieves 50% F1 on DocILE benchmark, up from 42%.
Abstract
Automating table extraction (TE) from business documents is critical for industrial workflows but remains challenging due to sparse annotations and error-prone multi-stage pipelines. While semi-supervised learning (SSL) can leverage unlabeled data, existing methods rely on confidence scores that poorly reflect extraction quality. We propose QUEST, a Quality-aware Semi-supervised Table extraction framework designed for business documents. QUEST introduces a novel quality assessment model that evaluates structural and contextual features of extracted tables, trained to predict F1 scores instead of relying on confidence metrics. This quality-aware approach guides pseudo-label selection during iterative SSL training, while diversity measures (DPP, Vendi score, IntDiv) mitigate confirmation bias. Experiments on a proprietary business dataset (1000 annotated + 10000 unannotated documents)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
