OCR-free Document Understanding Transformer
Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park,, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park

TL;DR
Donut is an OCR-free transformer-based model for document understanding that achieves state-of-the-art results, reducing computational costs and error propagation associated with traditional OCR-dependent methods.
Contribution
This paper introduces Donut, the first simple OCR-free transformer model for visual document understanding, with a novel pre-training approach and synthetic data generator for multilingual and multi-domain flexibility.
Findings
Achieves state-of-the-art performance on various VDU tasks.
Reduces computational costs compared to OCR-based methods.
Demonstrates robustness across languages and document types.
Abstract
Understanding document images (e.g., invoices) is a core but challenging task since it requires complex functions such as reading text and a holistic understanding of the document. Current Visual Document Understanding (VDU) methods outsource the task of reading text to off-the-shelf Optical Character Recognition (OCR) engines and focus on the understanding task with the OCR outputs. Although such OCR-based approaches have shown promising performance, they suffer from 1) high computational costs for using OCR; 2) inflexibility of OCR models on languages or types of document; 3) OCR error propagation to the subsequent process. To address these issues, in this paper, we introduce a novel OCR-free VDU model named Donut, which stands for Document understanding transformer. As the first step in OCR-free VDU research, we propose a simple architecture (i.e., Transformer) with a pre-training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗naver-clova-ix/donut-base-finetuned-docvqamodel· 17k dl· ♡ 27317k dl♡ 273
- 🤗naver-clova-ix/donut-base-finetuned-cord-v2model· 20k dl· ♡ 11920k dl♡ 119
- 🤗naver-clova-ix/donut-basemodel· 228k dl· ♡ 252228k dl♡ 252
- 🤗naver-clova-ix/donut-protomodel· 11 dl· ♡ 711 dl♡ 7
- 🤗naver-clova-ix/donut-base-finetuned-rvlcdipmodel· 2.5k dl· ♡ 202.5k dl♡ 20
- 🤗naver-clova-ix/donut-base-finetuned-cord-v1-2560model· 10 dl· ♡ 110 dl♡ 1
- 🤗naver-clova-ix/donut-base-finetuned-cord-v1model· 55 dl55 dl
- 🤗naver-clova-ix/donut-base-finetuned-zhtrainticketmodel· 99 dl99 dl
- 🤗philschmid/donut-base-finetuned-cord-v2model· 17 dl· ♡ 617 dl♡ 6
- 🤗jinhybr/OCR-Donut-CORDmodel· 60 dl· ♡ 20660 dl♡ 206
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Multimodal Machine Learning Applications · Video Analysis and Summarization
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
