DocXPand-25k: a large and diverse benchmark dataset for identity documents analysis
Julien Lerouge, Guillaume Betmont, Thomas Bres, Evgeny Stepankevich,, Alexis Berg\`es

TL;DR
The paper introduces DocXPand-25k, a large synthetic dataset of nearly 25,000 diverse, richly labeled identity document images designed to benchmark ID analysis methods amid privacy constraints.
Contribution
It provides a publicly available, synthetic dataset with diverse ID types and backgrounds, addressing the lack of large, annotated datasets for ID analysis research.
Findings
Rich diversity in ID layouts and contents
Synthetic images with real-world background variability
Public release of dataset and generation software
Abstract
Identity document (ID) image analysis has become essential for many online services, like bank account opening or insurance subscription. In recent years, much research has been conducted on subjects like document localization, text recognition and fraud detection, to achieve a level of accuracy reliable enough to automatize identity verification. However, there are only a few available datasets to benchmark ID analysis methods, mainly because of privacy restrictions, security requirements and legal reasons. In this paper, we present the DocXPand-25k dataset, which consists of 24,994 richly labeled IDs images, generated using custom-made vectorial templates representing nine fictitious ID designs, including four identity cards, two residence permits and three passports designs. These synthetic IDs feature artificially generated personal information (names, dates, identifiers, faces,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Computational and Text Analysis Methods
