DUBLIN -- Document Understanding By Language-Image Network

Kriti Aggarwal; Aditi Khandelwal; Kumar Tanmay; Owais Mohammed Khan,; Qiang Liu; Monojit Choudhury; Hardik Hansrajbhai Chauhan; Subhojit Som,; Vishrav Chaudhary; Saurabh Tiwary

arXiv:2305.14218·cs.CV·October 30, 2023·1 cites

DUBLIN -- Document Understanding By Language-Image Network

Kriti Aggarwal, Aditi Khandelwal, Kumar Tanmay, Owais Mohammed Khan,, Qiang Liu, Monojit Choudhury, Hardik Hansrajbhai Chauhan, Subhojit Som,, Vishrav Chaudhary, Saurabh Tiwary

PDF

Open Access

TL;DR

DUBLIN is a novel pretraining approach for visual document understanding that leverages spatial and semantic information through three objectives, achieving state-of-the-art results across multiple benchmarks and datasets.

Contribution

The paper introduces DUBLIN, a pixel-based model pretrained with three innovative objectives on web pages, enhancing generalization and performance in document understanding tasks.

Findings

01

Achieves 77.75 EM and 84.25 F1 on WebSRC dataset.

02

Outperforms SOTA on DocVQA, InfographicsVQA, OCR-VQA, and AI2D datasets.

03

Demonstrates competitive results on RVL-CDIP classification.

Abstract

Visual document understanding is a complex task that involves analyzing both the text and the visual elements in document images. Existing models often rely on manual feature engineering or domain-specific pipelines, which limit their generalization ability across different document types and languages. In this paper, we propose DUBLIN, which is pretrained on web pages using three novel objectives: Masked Document Text Generation Task, Bounding Box Task, and Rendered Question Answering Task, that leverage both the spatial and semantic information in the document images. Our model achieves competitive or state-of-the-art results on several benchmarks, such as Web-Based Structural Reading Comprehension, Document Visual Question Answering, Key Information Extraction, Diagram Understanding, and Table Question Answering. In particular, we show that DUBLIN is the first pixel-based model to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Natural Language Processing Techniques