Improving Information Extraction on Business Documents with Specific   Pre-Training Tasks

Thibault Douzon; Stefan Duffner; Christophe Garcia; J\'er\'emy; Espinas

arXiv:2309.05429·cs.CL·September 12, 2023

Improving Information Extraction on Business Documents with Specific Pre-Training Tasks

Thibault Douzon, Stefan Duffner, Christophe Garcia, J\'er\'emy, Espinas

PDF

1 Repo

TL;DR

This paper enhances transformer-based models for business document information extraction by introducing two specialized pre-training tasks focused on layout understanding and numeric values, leading to improved extraction accuracy.

Contribution

It proposes two novel pre-training tasks for LayoutLM that better capture document layout and numeric information, and introduces a new decoding algorithm for complex entities.

Findings

01

Significant F1 score improvements on public datasets (from 93.88 to 95.50)

02

Moderate F1 score improvements on private datasets (from 84.35 to 84.84)

03

Enhanced understanding of complex document structures

Abstract

Transformer-based Language Models are widely used in Natural Language Processing related tasks. Thanks to their pre-training, they have been successfully adapted to Information Extraction in business documents. However, most pre-training tasks proposed in the literature for business documents are too generic and not sufficient to learn more complex structures. In this paper, we use LayoutLM, a language model pre-trained on a collection of business documents, and introduce two new pre-training tasks that further improve its capacity to extract relevant information. The first is aimed at better understanding the complex layout of documents, and the second focuses on numeric values and their order of magnitude. These tasks force the model to learn better-contextualized representations of the scanned documents. We further introduce a new post-processing algorithm to decode BIESO tags in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thibaultdouzon/business-document-pre-training
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.