DocXPand-25k: a large and diverse benchmark dataset for identity   documents analysis

Julien Lerouge; Guillaume Betmont; Thomas Bres; Evgeny Stepankevich,; Alexis Berg\`es

arXiv:2407.20662·cs.CV·July 31, 2024·2 cites

DocXPand-25k: a large and diverse benchmark dataset for identity documents analysis

Julien Lerouge, Guillaume Betmont, Thomas Bres, Evgeny Stepankevich,, Alexis Berg\`es

PDF

Open Access 1 Repo

TL;DR

The paper introduces DocXPand-25k, a large synthetic dataset of nearly 25,000 diverse, richly labeled identity document images designed to benchmark ID analysis methods amid privacy constraints.

Contribution

It provides a publicly available, synthetic dataset with diverse ID types and backgrounds, addressing the lack of large, annotated datasets for ID analysis research.

Findings

01

Rich diversity in ID layouts and contents

02

Synthetic images with real-world background variability

03

Public release of dataset and generation software

Abstract

Identity document (ID) image analysis has become essential for many online services, like bank account opening or insurance subscription. In recent years, much research has been conducted on subjects like document localization, text recognition and fraud detection, to achieve a level of accuracy reliable enough to automatize identity verification. However, there are only a few available datasets to benchmark ID analysis methods, mainly because of privacy restrictions, security requirements and legal reasons. In this paper, we present the DocXPand-25k dataset, which consists of 24,994 richly labeled IDs images, generated using custom-made vectorial templates representing nine fictitious ID designs, including four identity cards, two residence permits and three passports designs. These synthetic IDs feature artificially generated personal information (names, dates, identifiers, faces,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

quicksign/docxpand
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Computational and Text Analysis Methods