QUEST: Quality-aware Semi-supervised Table Extraction for Business Documents

Eliott Thomas; Mickael Coustaty; Aurelie Joseph; Gaspar Deloin; Elodie Carel; Vincent Poulain D'Andecy; Jean-Marc Ogier

arXiv:2506.14568·cs.AI·June 24, 2025

QUEST: Quality-aware Semi-supervised Table Extraction for Business Documents

Eliott Thomas, Mickael Coustaty, Aurelie Joseph, Gaspar Deloin, Elodie Carel, Vincent Poulain D'Andecy, Jean-Marc Ogier

PDF

1 Repo

TL;DR

QUEST is a novel semi-supervised table extraction framework that uses a quality assessment model to improve accuracy and reduce errors in business document processing, leveraging unlabeled data effectively.

Contribution

It introduces a quality-aware pseudo-labeling approach with a new model predicting F1 scores, enhancing semi-supervised table extraction for business documents.

Findings

01

F1 score improved from 64% to 74% on proprietary data.

02

Reduces empty predictions by 45%.

03

Achieves 50% F1 on DocILE benchmark, up from 42%.

Abstract

Automating table extraction (TE) from business documents is critical for industrial workflows but remains challenging due to sparse annotations and error-prone multi-stage pipelines. While semi-supervised learning (SSL) can leverage unlabeled data, existing methods rely on confidence scores that poorly reflect extraction quality. We propose QUEST, a Quality-aware Semi-supervised Table extraction framework designed for business documents. QUEST introduces a novel quality assessment model that evaluates structural and contextual features of extracted tables, trained to predict F1 scores instead of relying on confidence metrics. This quality-aware approach guides pseudo-label selection during iterative SSL training, while diversity measures (DPP, Vendi score, IntDiv) mitigate confirmation bias. Experiments on a proprietary business dataset (1000 annotated + 10000 unannotated documents)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

eliottthomas99/data_quest
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.