# The meaning of "most" for visual question answering models

**Authors:** Alexander Kuhnle, Ann Copestake

arXiv: 1812.11737 · 2019-06-05

## TL;DR

This paper investigates how visual question answering models interpret the quantifier 'most', revealing that they develop an approximate number system similar to humans, which is affected by scene complexity and spatial factors.

## Contribution

It demonstrates that deep learning models for VQA can develop an approximate number system for quantifier interpretation, influenced by scene complexity and confounding spatial factors.

## Key findings

- Models exhibit Weber's law in performance decline with scene difficulty
- An approximate number system emerges in models for quantifier understanding
- Spatial arrangement confounds the model's interpretative accuracy

## Abstract

The correct interpretation of quantifier statements in the context of a visual scene requires non-trivial inference mechanisms. For the example of "most", we discuss two strategies which rely on fundamentally different cognitive concepts. Our aim is to identify what strategy deep learning models for visual question answering learn when trained on such questions. To this end, we carefully design data to replicate experiments from psycholinguistics where the same question was investigated for humans. Focusing on the FiLM visual question answering model, our experiments indicate that a form of approximate number system emerges whose performance declines with more difficult scenes as predicted by Weber's law. Moreover, we identify confounding factors, like spatial arrangement of the scene, which impede the effectiveness of this system.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1812.11737/full.md

## Figures

23 figures with captions in the complete paper: https://tomesphere.com/paper/1812.11737/full.md

## References

24 references — full list in the complete paper: https://tomesphere.com/paper/1812.11737/full.md

---
Source: https://tomesphere.com/paper/1812.11737