DocVLM: Make Your VLM an Efficient Reader

Mor Shpigel Nacson; Aviad Aberdam; Roy Ganz; Elad Ben Avraham; Alona; Golts; Yair Kittenplon; Shai Mazor; Ron Litman

arXiv:2412.08746·cs.CV·December 13, 2024

DocVLM: Make Your VLM an Efficient Reader

Mor Shpigel Nacson, Aviad Aberdam, Roy Ganz, Elad Ben Avraham, Alona, Golts, Yair Kittenplon, Shai Mazor, Ron Litman

PDF

Open Access 8 Models

TL;DR

DocVLM introduces an OCR-based modality to vision-language models, significantly improving document understanding efficiency and accuracy while reducing high-resolution image dependence, enabling multi-page processing and high-performance applications.

Contribution

It presents a novel OCR-augmented approach that enhances VLMs for document understanding, reducing computational costs and improving performance in reading-intensive tasks.

Findings

01

Substantial accuracy improvements in DocVQA tasks.

02

Reduced reliance on high-resolution images.

03

Effective multi-page document processing.

Abstract

Vision-Language Models (VLMs) excel in diverse visual tasks but face challenges in document understanding, which requires fine-grained text processing. While typical visual tasks perform well with low-resolution inputs, reading-intensive applications demand high-resolution, resulting in significant computational overhead. Using OCR-extracted text in VLM prompts partially addresses this issue but underperforms compared to full-resolution counterpart, as it lacks the complete visual context needed for optimal performance. We introduce DocVLM, a method that integrates an OCR-based modality into VLMs to enhance document processing while preserving original weights. Our approach employs an OCR encoder to capture textual content and layout, compressing these into a compact set of learned queries incorporated into the VLM. Comprehensive evaluations across leading VLMs show that DocVLM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling

MethodsSparse Evolutionary Training