DTrOCR: Decoder-only Transformer for Optical Character Recognition

Masato Fujitake

arXiv:2308.15996·cs.CV·August 31, 2023·2 cites

DTrOCR: Decoder-only Transformer for Optical Character Recognition

Masato Fujitake

PDF

Open Access 1 Repo

TL;DR

DTrOCR introduces a decoder-only Transformer architecture for optical character recognition, leveraging pre-trained generative language models to improve recognition accuracy across various text types and languages.

Contribution

This paper presents a novel decoder-only Transformer approach for OCR that simplifies the architecture and utilizes pre-trained language models, achieving superior performance.

Findings

01

Outperforms state-of-the-art OCR methods in multiple text recognition tasks.

02

Effective for printed, handwritten, and scene text in English and Chinese.

03

Demonstrates the versatility of generative language models in computer vision.

Abstract

Typical text recognition methods rely on an encoder-decoder structure, in which the encoder extracts features from an image, and the decoder produces recognized text from these features. In this study, we propose a simpler and more effective method for text recognition, known as the Decoder-only Transformer for Optical Character Recognition (DTrOCR). This method uses a decoder-only Transformer to take advantage of a generative language model that is pre-trained on a large corpus. We examined whether a generative language model that has been successful in natural language processing can also be effective for text recognition in computer vision. Our experiments demonstrated that DTrOCR outperforms current state-of-the-art methods by a large margin in the recognition of printed, handwritten, and scene text in both English and Chinese.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

arvindrajan92/DTrOCR
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Image Processing and 3D Reconstruction · Image Retrieval and Classification Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Dropout · Adam · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Dense Connections · Residual Connection