TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text
Amanpreet Singh, Guan Pang, Mandy Toh, Jing Huang, Wojciech Galuba,, Tal Hassner

TL;DR
This paper introduces TextOCR, a large-scale dataset for scene text detection and recognition in real images, and demonstrates how training on it improves OCR and reasoning performance in TextVQA tasks.
Contribution
The paper presents TextOCR, a new annotated dataset for arbitrary-shaped scene text, and shows how training on it enhances OCR accuracy and reasoning in TextVQA.
Findings
Current OCR models perform poorly on TextOCR.
Training on TextOCR improves OCR performance on multiple datasets.
Using TextOCR-trained OCR models enhances scene text reasoning in TextVQA.
Abstract
A crucial component for the scene text based reasoning required for TextVQA and TextCaps datasets involve detecting and recognizing text present in the images using an optical character recognition (OCR) system. The current systems are crippled by the unavailability of ground truth text annotations for these datasets as well as lack of scene text detection and recognition datasets on real images disallowing the progress in the field of OCR and evaluation of scene text based reasoning in isolation from OCR systems. In this work, we propose TextOCR, an arbitrary-shaped scene text detection and recognition with 900k annotated words collected on real images from TextVQA dataset. We show that current state-of-the-art text-recognition (OCR) models fail to perform well on TextOCR and that training on TextOCR helps achieve state-of-the-art performance on multiple other OCR datasets as well. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
