bbOCR: An Open-source Multi-domain OCR Pipeline for Bengali Documents
Imam Mohammad Zulkarnain, Shayekh Bin Islam, Md. Zami Al Zunaed, Farabe, Md. Mehedi Hasan Shawon, Jawaril Munshad Abedin, Beig Rajibul Hasan,, Marsia Haque, Istiak Shihab, Syed Mobassir, MD. Nazmuddoha Ansary, Asif, Sushmit, Farig Sadeque

TL;DR
bbOCR is an open-source OCR system tailored for Bengali documents, utilizing novel models and synthetic datasets to improve digitization in low-resource language contexts.
Contribution
Introduces a scalable Bengali OCR system with new recognition models and synthetic datasets, enhancing document digitization for low-resource languages.
Findings
Outperforms existing Bengali OCR systems in evaluations.
Uses novel synthetic datasets for training and evaluation.
Provides open-source code and datasets for community use.
Abstract
Despite the existence of numerous Optical Character Recognition (OCR) tools, the lack of comprehensive open-source systems hampers the progress of document digitization in various low-resource languages, including Bengali. Low-resource languages, especially those with an alphasyllabary writing system, suffer from the lack of large-scale datasets for various document OCR components such as word-level OCR, document layout extraction, and distortion correction; which are available as individual modules in high-resource languages. In this paper, we introduce BengaliAI-BRACU-OCR (bbOCR): an open-source scalable document OCR system that can reconstruct Bengali documents into a structured searchable digitized format that leverages a novel Bengali text recognition model and two novel synthetic datasets. We present extensive component-level and system-level evaluation: both use a novel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Vehicle License Plate Recognition · Image Retrieval and Classification Techniques
