# Scene Text Visual Question Answering

**Authors:** Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Mar\c{c}al, Rusi\~nol, Ernest Valveny, C.V. Jawahar, Dimosthenis Karatzas

arXiv: 1905.13648 · 2019-10-17

## TL;DR

This paper introduces the ST-VQA dataset to emphasize the importance of scene text understanding in visual question answering, proposing new tasks, evaluation metrics, and baseline methods to advance research in this area.

## Contribution

The paper presents a new dataset, ST-VQA, with tasks requiring scene text comprehension, along with a novel evaluation metric and baseline methods for scene text VQA.

## Key findings

- The ST-VQA dataset highlights the significance of scene text in VQA.
- A new evaluation metric balances reasoning and text recognition errors.
- Baseline methods provide insights into scene text VQA challenges.

## Abstract

Current visual question answering datasets do not consider the rich semantic information conveyed by text within an image. In this work, we present a new dataset, ST-VQA, that aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the VQA process. We use this dataset to define a series of tasks of increasing difficulty for which reading the scene text in the context provided by the visual information is necessary to reason and generate an appropriate answer. We propose a new evaluation metric for these tasks to account both for reasoning errors as well as shortcomings of the text recognition module. In addition we put forward a series of baseline methods, which provide further insight to the newly released dataset, and set the scene for further research.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1905.13648/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/1905.13648/full.md

## References

60 references — full list in the complete paper: https://tomesphere.com/paper/1905.13648/full.md

---
Source: https://tomesphere.com/paper/1905.13648