DocVLM: Make Your VLM an Efficient Reader
Mor Shpigel Nacson, Aviad Aberdam, Roy Ganz, Elad Ben Avraham, Alona, Golts, Yair Kittenplon, Shai Mazor, Ron Litman

TL;DR
DocVLM introduces an OCR-based modality to vision-language models, significantly improving document understanding efficiency and accuracy while reducing high-resolution image dependence, enabling multi-page processing and high-performance applications.
Contribution
It presents a novel OCR-augmented approach that enhances VLMs for document understanding, reducing computational costs and improving performance in reading-intensive tasks.
Findings
Substantial accuracy improvements in DocVQA tasks.
Reduced reliance on high-resolution images.
Effective multi-page document processing.
Abstract
Vision-Language Models (VLMs) excel in diverse visual tasks but face challenges in document understanding, which requires fine-grained text processing. While typical visual tasks perform well with low-resolution inputs, reading-intensive applications demand high-resolution, resulting in significant computational overhead. Using OCR-extracted text in VLM prompts partially addresses this issue but underperforms compared to full-resolution counterpart, as it lacks the complete visual context needed for optimal performance. We introduce DocVLM, a method that integrates an OCR-based modality into VLMs to enhance document processing while preserving original weights. Our approach employs an OCR encoder to capture textual content and layout, compressing these into a compact set of learned queries incorporated into the VLM. Comprehensive evaluations across leading VLMs show that DocVLM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗prithivMLmods/coreOCR-7B-050325-previewmodel· 218 dl· ♡ 13218 dl♡ 13
- 🤗prithivMLmods/docscopeOCR-7B-050425-expmodel· 175 dl· ♡ 7175 dl♡ 7
- 🤗prithivMLmods/visionOCR-3B-061125model· 532 dl· ♡ 5532 dl♡ 5
- 🤗prithivMLmods/DREX-062225-expmodel· 13 dl· ♡ 613 dl♡ 6
- 🤗prithivMLmods/Camel-Doc-OCR-062825model· 55 dl· ♡ 1355 dl♡ 13
- 🤗prithivMLmods/Camel-Doc-OCR-080125model· 231 dl· ♡ 8231 dl♡ 8
- 🤗prithivMLmods/proxima-ocr-d.markdown-post3.0.lmodel· 102 dl· ♡ 5102 dl♡ 5
- 🤗prithivMLmods/epsilon-ocr-d.markdown-post3.0.mmodel· 8 dl· ♡ 38 dl♡ 3
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling
MethodsSparse Evolutionary Training
