BaDLAD: A Large Multi-Domain Bengali Document Layout Analysis Dataset

Md. Istiak Hossain Shihab; Md. Rakibul Hasan; Mahfuzur Rahman Emon,; Syed Mobassir Hossen; Md. Nazmuddoha Ansary; Intesur Ahmed; Fazle Rabbi; Rakib; Shahriar Elahi Dhruvo; Souhardya Saha Dip; Akib Hasan Pavel; Marsia; Haque Meghla; Md. Rezwanul Haque; Sayma Sultana Chowdhury; Farig Sadeque,; Tahsin Reasat; Ahmed Imtiaz Humayun; Asif Shahriyar Sushmit

arXiv:2303.05325·cs.CV·May 8, 2023·1 cites

BaDLAD: A Large Multi-Domain Bengali Document Layout Analysis Dataset

Md. Istiak Hossain Shihab, Md. Rakibul Hasan, Mahfuzur Rahman Emon,, Syed Mobassir Hossen, Md. Nazmuddoha Ansary, Intesur Ahmed, Fazle Rabbi, Rakib, Shahriar Elahi Dhruvo, Souhardya Saha Dip, Akib Hasan Pavel, Marsia, Haque Meghla, Md. Rezwanul Haque, Sayma Sultana Chowdhury

PDF

Open Access 1 Repo

TL;DR

BaDLAD is the first large, multi-domain Bengali document layout analysis dataset, enabling improved deep learning models for Bengali OCR and document transcription, especially for historical and domain-specific documents.

Contribution

This paper introduces BaDLAD, the first large-scale multi-domain Bengali DLA dataset with extensive annotations, facilitating research in Bengali document digitization.

Findings

01

Existing deep learning models perform well on BaDLAD benchmarks.

02

BaDLAD enables effective training of Bengali OCR models.

03

The dataset covers diverse document types and domains.

Abstract

While strides have been made in deep learning based Bengali Optical Character Recognition (OCR) in the past decade, the absence of large Document Layout Analysis (DLA) datasets has hindered the application of OCR in document transcription, e.g., transcribing historical documents and newspapers. Moreover, rule-based DLA systems that are currently being employed in practice are not robust to domain variations and out-of-distribution layouts. To this end, we present the first multidomain large Bengali Document Layout Analysis Dataset: BaDLAD. This dataset contains 33,695 human annotated document samples from six domains - i) books and magazines, ii) public domain govt. documents, iii) liberation war documents, iv) newspapers, v) historical newspapers, and vi) property deeds, with 710K polygon annotations for four unit types: text-box, paragraph, image, and table. Through preliminary…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

anon-user-for-web/badlad
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Digital Media Forensic Detection · Image Processing and 3D Reconstruction

MethodsDeep Layer Aggregation