# Fusion of Detected Objects in Text for Visual Question Answering

**Authors:** Chris Alberti, Jeffrey Ling, Michael Collins, David Reitter

arXiv: 1908.05054 · 2019-11-05

## TL;DR

This paper introduces B2T2, a neural architecture that unifies vision and language for visual question answering, achieving state-of-the-art results by integrating visual features early in the text analysis process.

## Contribution

The paper presents B2T2, a novel architecture that combines vision and language in a unified model, improving performance on visual reasoning tasks.

## Key findings

- Achieved new state-of-the-art on Visual Commonsense Reasoning benchmark.
- Reduced error rate by 25% relative to previous baselines.
- Early visual feature integration is crucial for model effectiveness.

## Abstract

To advance models of multimodal context, we introduce a simple yet powerful neural architecture for data that combines vision and natural language. The "Bounding Boxes in Text Transformer" (B2T2) also leverages referential information binding words to portions of the image in a single unified architecture. B2T2 is highly effective on the Visual Commonsense Reasoning benchmark (https://visualcommonsense.com), achieving a new state-of-the-art with a 25% relative reduction in error rate compared to published baselines and obtaining the best performance to date on the public leaderboard (as of May 22, 2019). A detailed ablation analysis shows that the early integration of the visual features into the text analysis is key to the effectiveness of the new architecture. A reference implementation of our models is provided (https://github.com/google-research/language/tree/master/language/question_answering/b2t2).

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1908.05054/full.md

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/1908.05054/full.md

## References

30 references — full list in the complete paper: https://tomesphere.com/paper/1908.05054/full.md

---
Source: https://tomesphere.com/paper/1908.05054