# Brno Mobile OCR Dataset

**Authors:** Martin Ki\v{s}\v{s}, Michal Hradi\v{s}, Old\v{r}ich Kodym

arXiv: 1907.01307 · 2019-07-03

## TL;DR

The Brno Mobile OCR Dataset (B-MOD) provides a challenging collection of low-quality mobile-captured document images with detailed annotations, enabling robust development and evaluation of OCR methods for real-world mobile scenarios.

## Contribution

This paper introduces the first dataset specifically designed for OCR on low-quality mobile images, including comprehensive annotations and an evaluation framework.

## Key findings

- State-of-the-art baseline achieves 2%, 22%, and 73% WER on easy, medium, and hard subsets.
- Dataset contains 19,728 images from 23 mobile devices with 500k text line annotations.
- The dataset is challenging and suitable for advancing OCR robustness in mobile environments.

## Abstract

We introduce the Brno Mobile OCR Dataset (B-MOD) for document Optical Character Recognition from low-quality images captured by handheld mobile devices. While OCR of high-quality scanned documents is a mature field where many commercial tools are available, and large datasets of text in the wild exist, no existing datasets can be used to develop and test document OCR methods robust to non-uniform lighting, image blur, strong noise, built-in denoising, sharpening, compression and other artifacts present in many photographs from mobile devices.   This dataset contains 2 113 unique pages from random scientific papers, which were photographed by multiple people using 23 different mobile devices. The resulting 19 728 photographs of various visual quality are accompanied by precise positions and text annotations of 500k text lines. We further provide an evaluation methodology, including an evaluation server and a testset with non-public annotations.   We provide a state-of-the-art text recognition baseline build on convolutional and recurrent neural networks trained with Connectionist Temporal Classification loss. This baseline achieves 2 %, 22 % and 73 % word error rates on easy, medium and hard parts of the dataset, respectively, confirming that the dataset is challenging.   The presented dataset will enable future development and evaluation of document analysis for low-quality images. It is primarily intended for line-level text recognition, and can be further used for line localization, layout analysis, image restoration and text binarization.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1907.01307/full.md

## Figures

34 figures with captions in the complete paper: https://tomesphere.com/paper/1907.01307/full.md

## References

20 references — full list in the complete paper: https://tomesphere.com/paper/1907.01307/full.md

---
Source: https://tomesphere.com/paper/1907.01307