DoPTA: Improving Document Layout Analysis using Patch-Text Alignment
Nikitha SR, Tarun Ram Menta, Mausoom Sarkar

TL;DR
This paper introduces DoPTA, a novel image-text alignment method for document understanding that improves performance without OCR during inference, outperforming larger models on key benchmarks.
Contribution
The paper presents a new image-text alignment technique for document analysis that enhances visual task performance without relying on OCR at inference time.
Findings
Sets new state-of-the-art on D4LA and FUNSD benchmarks.
Outperforms larger models with less pre-training compute.
Effective for a wide range of document understanding tasks.
Abstract
The advent of multimodal learning has brought a significant improvement in document AI. Documents are now treated as multimodal entities, incorporating both textual and visual information for downstream analysis. However, works in this space are often focused on the textual aspect, using the visual space as auxiliary information. While some works have explored pure vision based techniques for document image understanding, they require OCR identified text as input during inference, or do not align with text in their learning procedure. Therefore, we present a novel image-text alignment technique specially designed for leveraging the textual information in document images to improve performance on visual tasks. Our document encoder model DoPTA - trained with this technique demonstrates strong performance on a wide range of document image understanding tasks, without requiring OCR during…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Image Processing and 3D Reconstruction
MethodsALIGN
