TAP-VL: Text Layout-Aware Pre-training for Enriched Vision-Language   Models

Jonathan Fhima; Elad Ben Avraham; Oren Nuriel; Yair Kittenplon; Roy; Ganz; Aviad Aberdam; Ron Litman

arXiv:2411.04642·cs.CV·November 8, 2024

TAP-VL: Text Layout-Aware Pre-training for Enriched Vision-Language Models

Jonathan Fhima, Elad Ben Avraham, Oren Nuriel, Yair Kittenplon, Roy, Ganz, Aviad Aberdam, Ron Litman

PDF

Open Access

TL;DR

TAP-VL introduces a novel approach that treats OCR-derived text as a separate modality, enhancing vision-language models' ability to understand text within images through a lightweight transformer-based integration.

Contribution

The paper proposes TAP-VL, a new method that integrates OCR information as a distinct modality into VL models, improving text understanding in images.

Findings

01

Consistent performance improvements across multiple VL benchmarks.

02

Effective integration of OCR as a separate modality enhances text comprehension.

03

Pretraining of OCR module on unlabeled data benefits downstream VL tasks.

Abstract

Vision-Language (VL) models have garnered considerable research interest; however, they still face challenges in effectively handling text within images. To address this limitation, researchers have developed two approaches. The first method involves utilizing external Optical Character Recognition (OCR) tools to extract textual information from images, which is then prepended to other textual inputs. The second strategy focuses on employing extremely high-resolution images to improve text recognition capabilities. In this paper, we focus on enhancing the first strategy by introducing a novel method, named TAP-VL, which treats OCR information as a distinct modality and seamlessly integrates it into any VL model. TAP-VL employs a lightweight transformer-based OCR module to receive OCR with layout information, compressing it into a short fixed-length sequence for input into the LLM.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling

MethodsFocus