Development of a New Image-to-text Conversion System for Pashto, Farsi   and Traditional Chinese

Marek Rychlik; and Dwight Nwaigwe; Yan Han; Dylan Murphy

arXiv:2005.08650·cs.CV·May 19, 2020·1 cites

Development of a New Image-to-text Conversion System for Pashto, Farsi and Traditional Chinese

Marek Rychlik, and Dwight Nwaigwe, Yan Han, Dylan Murphy

PDF

Open Access

TL;DR

This paper presents Worldly OCR, a deep learning-based system for accurate image-to-text conversion across multiple languages and scripts, including cursive and non-cursive writing systems, targeting large-scale digital document processing.

Contribution

It introduces a novel OCR approach tailored for diverse scripts like Pashto, Farsi, and Traditional Chinese, handling large character sets and cursive writing.

Findings

01

Achieved high accuracy in cursive scripts Pashto and Farsi.

02

Developed methods for Traditional Chinese with 65,000 characters.

03

Scalable system designed for over a billion pages.

Abstract

We report upon the results of a research and prototype building project \emph{Worldly~OCR} dedicated to developing new, more accurate image-to-text conversion software for several languages and writing systems. These include the cursive scripts Farsi and Pashto, and Latin cursive scripts. We also describe approaches geared towards Traditional Chinese, which is non-cursive, but features an extremely large character set of 65,000 characters. Our methodology is based on Machine Learning, especially Deep Learning, and Data Science, and is directed towards vast quantities of original documents, exceeding a billion pages. The target audience of this paper is a general audience with interest in Digital Humanities or in retrieval of accurate full-text and metadata from digital images.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Image Retrieval and Classification Techniques · Mathematics, Computing, and Information Processing