OCR-free Document Understanding Transformer

Geewook Kim; Teakgyu Hong; Moonbin Yim; Jeongyeon Nam; Jinyoung Park,; Jinyeong Yim; Wonseok Hwang; Sangdoo Yun; Dongyoon Han; Seunghyun Park

arXiv:2111.15664·cs.LG·October 7, 2022·5 cites

OCR-free Document Understanding Transformer

Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park,, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park

PDF

Open Access 5 Repos 10 Models

TL;DR

Donut is an OCR-free transformer-based model for document understanding that achieves state-of-the-art results, reducing computational costs and error propagation associated with traditional OCR-dependent methods.

Contribution

This paper introduces Donut, the first simple OCR-free transformer model for visual document understanding, with a novel pre-training approach and synthetic data generator for multilingual and multi-domain flexibility.

Findings

01

Achieves state-of-the-art performance on various VDU tasks.

02

Reduces computational costs compared to OCR-based methods.

03

Demonstrates robustness across languages and document types.

Abstract

Understanding document images (e.g., invoices) is a core but challenging task since it requires complex functions such as reading text and a holistic understanding of the document. Current Visual Document Understanding (VDU) methods outsource the task of reading text to off-the-shelf Optical Character Recognition (OCR) engines and focus on the understanding task with the OCR outputs. Although such OCR-based approaches have shown promising performance, they suffer from 1) high computational costs for using OCR; 2) inflexibility of OCR models on languages or types of document; 3) OCR error propagation to the subsequent process. To address these issues, in this paper, we introduce a novel OCR-free VDU model named Donut, which stands for Document understanding transformer. As the first step in OCR-free VDU research, we propose a simple architecture (i.e., Transformer) with a pre-training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Multimodal Machine Learning Applications · Video Analysis and Summarization

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings