VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding
Ofir Abramovich, Niv Nayman, Sharon Fogel, Inbal Lavi, Ron Litman,, Shahar Tsiper, Royee Tichauer, Srikar Appalaraju, Shai Mazor, R. Manmatha

TL;DR
VisFocus introduces a prompt-guided vision encoder for OCR-free dense document understanding, enabling better focus on relevant document parts and achieving state-of-the-art results.
Contribution
The paper proposes a novel prompt-guided vision encoder architecture with a new pre-training task, enhancing OCR-free document understanding capabilities.
Findings
Significant performance improvements on multiple benchmarks.
Effective highlighting of relevant document regions.
State-of-the-art results achieved.
Abstract
In recent years, notable advancements have been made in the domain of visual document understanding, with the prevailing architecture comprising a cascade of vision and language models. The text component can either be extracted explicitly with the use of external OCR models in OCR-based approaches, or alternatively, the vision model can be endowed with reading capabilities in OCR-free approaches. Typically, the queries to the model are input exclusively to the language component, necessitating the visual features to encompass the entire document. In this paper, we present VisFocus, an OCR-free method designed to better exploit the vision encoder's capacity by coupling it directly with the language prompt. To do so, we replace the down-sampling layers with layers that receive the input prompt and allow highlighting relevant parts of the document, while disregarding others. We pair the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Image Processing and 3D Reconstruction · Mathematics, Computing, and Information Processing
MethodsSoftmax · Attention Is All You Need
