TAP-VL: Text Layout-Aware Pre-training for Enriched Vision-Language Models
Jonathan Fhima, Elad Ben Avraham, Oren Nuriel, Yair Kittenplon, Roy, Ganz, Aviad Aberdam, Ron Litman

TL;DR
TAP-VL introduces a novel approach that treats OCR-derived text as a separate modality, enhancing vision-language models' ability to understand text within images through a lightweight transformer-based integration.
Contribution
The paper proposes TAP-VL, a new method that integrates OCR information as a distinct modality into VL models, improving text understanding in images.
Findings
Consistent performance improvements across multiple VL benchmarks.
Effective integration of OCR as a separate modality enhances text comprehension.
Pretraining of OCR module on unlabeled data benefits downstream VL tasks.
Abstract
Vision-Language (VL) models have garnered considerable research interest; however, they still face challenges in effectively handling text within images. To address this limitation, researchers have developed two approaches. The first method involves utilizing external Optical Character Recognition (OCR) tools to extract textual information from images, which is then prepended to other textual inputs. The second strategy focuses on employing extremely high-resolution images to improve text recognition capabilities. In this paper, we focus on enhancing the first strategy by introducing a novel method, named TAP-VL, which treats OCR information as a distinct modality and seamlessly integrates it into any VL model. TAP-VL employs a lightweight transformer-based OCR module to receive OCR with layout information, compressing it into a short fixed-length sequence for input into the LLM.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
MethodsFocus
