DoPTA: Improving Document Layout Analysis using Patch-Text Alignment

Nikitha SR; Tarun Ram Menta; Mausoom Sarkar

arXiv:2412.12902·cs.CV·March 11, 2025

DoPTA: Improving Document Layout Analysis using Patch-Text Alignment

Nikitha SR, Tarun Ram Menta, Mausoom Sarkar

PDF

Open Access

TL;DR

This paper introduces DoPTA, a novel image-text alignment method for document understanding that improves performance without OCR during inference, outperforming larger models on key benchmarks.

Contribution

The paper presents a new image-text alignment technique for document analysis that enhances visual task performance without relying on OCR at inference time.

Findings

01

Sets new state-of-the-art on D4LA and FUNSD benchmarks.

02

Outperforms larger models with less pre-training compute.

03

Effective for a wide range of document understanding tasks.

Abstract

The advent of multimodal learning has brought a significant improvement in document AI. Documents are now treated as multimodal entities, incorporating both textual and visual information for downstream analysis. However, works in this space are often focused on the textual aspect, using the visual space as auxiliary information. While some works have explored pure vision based techniques for document image understanding, they require OCR identified text as input during inference, or do not align with text in their learning procedure. Therefore, we present a novel image-text alignment technique specially designed for leveraging the textual information in document images to improve performance on visual tasks. Our document encoder model DoPTA - trained with this technique demonstrates strong performance on a wide range of document image understanding tasks, without requiring OCR during…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Image Processing and 3D Reconstruction

MethodsALIGN