DDI-100: Dataset for Text Detection and Recognition

Ilia Zharikov; Filipp Nikitin; Ilia Vasiliev; Vladimir Dokholyan; (Moscow Institute of Physics; Technology)

arXiv:1912.11658·cs.CV·September 20, 2021

DDI-100: Dataset for Text Detection and Recognition

Ilia Zharikov, Filipp Nikitin, Ilia Vasiliev, Vladimir Dokholyan, (Moscow Institute of Physics, Technology)

PDF

2 Repos 1 Datasets

TL;DR

The paper introduces DDI-100, a large synthetic dataset with over 100,000 images for improving text detection and recognition in document analysis, addressing the scarcity of specialized datasets.

Contribution

It presents a new synthetic dataset, DDI-100, designed specifically for text detection and OCR tasks, based on real document pages with extensive annotations.

Findings

01

High-quality performance of models on real data using DDI-100

02

Demonstrates usefulness of synthetic data for document analysis

03

Supports various document analysis tasks

Abstract

Nowadays document analysis and recognition remain challenging tasks. However, only a few datasets designed for text detection (TD) and optical character recognition (OCR) problems exist. In this paper we present Distorted Document Images dataset (DDI-100) and demonstrate its usefulness in a wide range of document analysis problems. DDI-100 dataset is a synthetic dataset based on 7000 real unique document pages and consists of more than 100000 augmented images. Ground truth comprises text and stamp masks, text and characters bounding boxes with relevant annotations. Validation of DDI-100 dataset was conducted using several TD and OCR models that show high-quality performance on real data.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

mapo80/barcodes
dataset· 68 dl
68 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.