DDI-100: Dataset for Text Detection and Recognition
Ilia Zharikov, Filipp Nikitin, Ilia Vasiliev, Vladimir Dokholyan, (Moscow Institute of Physics, Technology)

TL;DR
The paper introduces DDI-100, a large synthetic dataset with over 100,000 images for improving text detection and recognition in document analysis, addressing the scarcity of specialized datasets.
Contribution
It presents a new synthetic dataset, DDI-100, designed specifically for text detection and OCR tasks, based on real document pages with extensive annotations.
Findings
High-quality performance of models on real data using DDI-100
Demonstrates usefulness of synthetic data for document analysis
Supports various document analysis tasks
Abstract
Nowadays document analysis and recognition remain challenging tasks. However, only a few datasets designed for text detection (TD) and optical character recognition (OCR) problems exist. In this paper we present Distorted Document Images dataset (DDI-100) and demonstrate its usefulness in a wide range of document analysis problems. DDI-100 dataset is a synthetic dataset based on 7000 real unique document pages and consists of more than 100000 augmented images. Ground truth comprises text and stamp masks, text and characters bounding boxes with relevant annotations. Validation of DDI-100 dataset was conducted using several TD and OCR models that show high-quality performance on real data.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
