Towards Models that Can See and Read
Roy Ganz, Oren Nuriel, Aviad Aberdam, Yair Kittenplon, Shai Mazor, Ron, Litman

TL;DR
This paper introduces UniTNT, a unified model that enables vision-language architectures to understand scene-text, improving performance on VQA and Image Captioning tasks by integrating text as an additional modality.
Contribution
The paper presents UniTNT, the first unified model that combines scene-text understanding with existing multimodal architectures for VQA and Captioning.
Findings
UniTNT successfully handles both VQA and Captioning tasks.
Scene-text understanding improves VQA performance by up to 2.69%.
Scene-text understanding enhances Captioning with up to 0.6 CIDEr score increase.
Abstract
Visual Question Answering (VQA) and Image Captioning (CAP), which are among the most popular vision-language tasks, have analogous scene-text versions that require reasoning from the text in the image. Despite their obvious resemblance, the two are treated independently and, as we show, yield task-specific methods that can either see or read, but not both. In this work, we conduct an in-depth analysis of this phenomenon and propose UniTNT, a Unified Text-Non-Text approach, which grants existing multimodal architectures scene-text understanding capabilities. Specifically, we treat scene-text information as an additional modality, fusing it with any pretrained encoder-decoder-based architecture via designated modules. Thorough experiments reveal that UniTNT leads to the first single model that successfully handles both task types. Moreover, we show that scene-text understanding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
