# Towards VQA Models That Can Read

**Authors:** Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen,, Dhruv Batra, Devi Parikh, Marcus Rohrbach

arXiv: 1904.08920 · 2019-05-15

## TL;DR

This paper introduces the TextVQA dataset and a novel model, LoRRA, to enable Visual Question Answering systems to read and reason about text in images, addressing a key limitation of current models.

## Contribution

The paper presents a new dataset, TextVQA, and a model architecture, LoRRA, that reads, reasons about, and answers questions involving text in images, advancing VQA capabilities.

## Key findings

- LoRRA outperforms existing VQA models on TextVQA.
- Significant performance gap exists between humans and machines on TextVQA.
- TextVQA is a suitable benchmark for progress in reading comprehension in VQA.

## Abstract

Studies have shown that a dominant class of questions asked by visually impaired users on images of their surroundings involves reading text in the image. But today's VQA models can not read! Our paper takes a first step towards addressing this problem. First, we introduce a new "TextVQA" dataset to facilitate progress on this important problem. Existing datasets either have a small proportion of questions about text (e.g., the VQA dataset) or are too small (e.g., the VizWiz dataset). TextVQA contains 45,336 questions on 28,408 images that require reasoning about text to answer. Second, we introduce a novel model architecture that reads text in the image, reasons about it in the context of the image and the question, and predicts an answer which might be a deduction based on the text and the image or composed of the strings found in the image. Consequently, we call our approach Look, Read, Reason & Answer (LoRRA). We show that LoRRA outperforms existing state-of-the-art VQA models on our TextVQA dataset. We find that the gap between human performance and machine performance is significantly larger on TextVQA than on VQA 2.0, suggesting that TextVQA is well-suited to benchmark progress along directions complementary to VQA 2.0.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1904.08920/full.md

## Figures

43 figures with captions in the complete paper: https://tomesphere.com/paper/1904.08920/full.md

## References

51 references — full list in the complete paper: https://tomesphere.com/paper/1904.08920/full.md

---
Source: https://tomesphere.com/paper/1904.08920