End-to-end information extraction in handwritten documents:   Understanding Paris marriage records from 1880 to 1940

Thomas Constum; Lucas Preel; Th\'eo Larcher; Pierrick Tranouez,; Thierry Paquet; Sandra Br\'ee

arXiv:2404.19329·cs.CV·May 1, 2024

End-to-end information extraction in handwritten documents: Understanding Paris marriage records from 1880 to 1940

Thomas Constum, Lucas Preel, Th\'eo Larcher, Pierrick Tranouez,, Thierry Paquet, Sandra Br\'ee

PDF

1 Datasets

TL;DR

This paper introduces an end-to-end deep learning architecture for extracting detailed information from handwritten Paris marriage records spanning 1880-1940, achieving state-of-the-art results and providing a new dataset for research.

Contribution

The paper presents a novel end-to-end model for joint handwritten text recognition and information extraction, along with a new annotated dataset for full-page documents.

Findings

01

Achieved state-of-the-art full-page information extraction results on Esposalles.

02

Demonstrated the effectiveness of different encoding strategies for named entity recognition.

03

Provided a publicly available annotated dataset for handwritten document analysis.

Abstract

The EXO-POPP project aims to establish a comprehensive database comprising 300,000 marriage records from Paris and its suburbs, spanning the years 1880 to 1940, which are preserved in over 130,000 scans of double pages. Each marriage record may encompass up to 118 distinct types of information that require extraction from plain text. In this paper, we introduce the M-POPP dataset, a subset of the M-POPP database with annotations for full-page text recognition and information extraction in both handwritten and printed documents, and which is now publicly available. We present a fully end-to-end architecture adapted from the DAN, designed to perform both handwritten text recognition and information extraction directly from page images without the need for explicit segmentation. We showcase the information extraction capabilities of this architecture by achieving a new state of the art for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

thomas-C/m-popp
dataset· 56 dl
56 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.