Towards Models that Can See and Read

Roy Ganz; Oren Nuriel; Aviad Aberdam; Yair Kittenplon; Shai Mazor; Ron; Litman

arXiv:2301.07389·cs.CV·March 22, 2023

Towards Models that Can See and Read

Roy Ganz, Oren Nuriel, Aviad Aberdam, Yair Kittenplon, Shai Mazor, Ron, Litman

PDF

Open Access

TL;DR

This paper introduces UniTNT, a unified model that enables vision-language architectures to understand scene-text, improving performance on VQA and Image Captioning tasks by integrating text as an additional modality.

Contribution

The paper presents UniTNT, the first unified model that combines scene-text understanding with existing multimodal architectures for VQA and Captioning.

Findings

01

UniTNT successfully handles both VQA and Captioning tasks.

02

Scene-text understanding improves VQA performance by up to 2.69%.

03

Scene-text understanding enhances Captioning with up to 0.6 CIDEr score increase.

Abstract

Visual Question Answering (VQA) and Image Captioning (CAP), which are among the most popular vision-language tasks, have analogous scene-text versions that require reasoning from the text in the image. Despite their obvious resemblance, the two are treated independently and, as we show, yield task-specific methods that can either see or read, but not both. In this work, we conduct an in-depth analysis of this phenomenon and propose UniTNT, a Unified Text-Non-Text approach, which grants existing multimodal architectures scene-text understanding capabilities. Specifically, we treat scene-text information as an additional modality, fusing it with any pretrained encoder-decoder-based architecture via designated modules. Thorough experiments reveal that UniTNT leads to the first single model that successfully handles both task types. Moreover, we show that scene-text understanding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning