DocBed: A Multi-Stage OCR Solution for Documents with Complex Layouts

Wenzhen Zhu; Negin Sokhandan; Guang Yang; Sujitha Martin; Suchitra; Sathyanarayana

arXiv:2202.01414·cs.CV·February 4, 2022·1 cites

DocBed: A Multi-Stage OCR Solution for Documents with Complex Layouts

Wenzhen Zhu, Negin Sokhandan, Guang Yang, Sujitha Martin, Suchitra, Sathyanarayana

PDF

Open Access 1 Video

TL;DR

This paper introduces DocBed, a multi-stage OCR system tailored for complex newspaper layouts, featuring a new dataset, layout segmentation as a precursor to OCR, and a comprehensive evaluation protocol.

Contribution

It provides a large annotated newspaper dataset, proposes layout segmentation as a key step before OCR, and establishes evaluation standards for complex document digitization.

Findings

01

New dataset of 3000 annotated newspaper images from 21 states.

02

Layout segmentation improves OCR accuracy on complex layouts.

03

Thorough evaluation protocol for layout segmentation and OCR performance.

Abstract

Digitization of newspapers is of interest for many reasons including preservation of history, accessibility and search ability, etc. While digitization of documents such as scientific articles and magazines is prevalent in literature, one of the main challenges for digitization of newspaper lies in its complex layout (e.g. articles spanning multiple columns, text interrupted by images) analysis, which is necessary to preserve human read-order. This work provides a major breakthrough in the digitization of newspapers on three fronts: first, releasing a dataset of 3000 fully-annotated, real-world newspaper images from 21 different U.S. states representing an extensive variety of complex layouts for document layout analysis; second, proposing layout segmentation as a precursor to existing optical character recognition (OCR) engines, where multiple state-of-the-art image segmentation models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

DocBed: A Multi-Stage OCR Solution for Documents with Complex Layouts· underline

Taxonomy

TopicsHandwritten Text Recognition Techniques · Vehicle License Plate Recognition · Image Processing and 3D Reconstruction