TxT: Crossmodal End-to-End Learning with Transformers

Jan-Martin O. Steitz; Jonas Pfeiffer; Iryna Gurevych; Stefan Roth

arXiv:2109.04422·cs.CV·September 10, 2021

TxT: Crossmodal End-to-End Learning with Transformers

Jan-Martin O. Steitz, Jonas Pfeiffer, Iryna Gurevych, Stefan Roth

PDF

TL;DR

TxT introduces a transformer-based end-to-end multimodal pipeline that jointly fine-tunes language and visual components, improving reasoning capabilities in tasks like Visual Question Answering by overcoming previous limitations of fixed visual features.

Contribution

The paper presents a novel transformer-based crossmodal architecture that enables fully end-to-end training and fine-tuning of both language and visual modules for multimodal reasoning.

Findings

01

Significant performance improvements in multimodal question answering.

02

Effective integration of global context in transformer-based visual detectors.

03

Scalability enhancements for multimodal reasoning models.

Abstract

Reasoning over multiple modalities, e.g. in Visual Question Answering (VQA), requires an alignment of semantic concepts across domains. Despite the widespread success of end-to-end learning, today's multimodal pipelines by and large leverage pre-extracted, fixed features from object detectors, typically Faster R-CNN, as representations of the visual world. The obvious downside is that the visual representation is not specifically tuned to the multimodal task at hand. At the same time, while transformer-based object detectors have gained popularity, they have not been employed in today's multimodal pipelines. We address both shortcomings with TxT, a transformer-based crossmodal pipeline that enables fine-tuning both language and visual components on the downstream task in a fully end-to-end manner. We overcome existing limitations of transformer-based detectors for multimodal reasoning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsRoIPool · Convolution · Region Proposal Network · Softmax · Faster R-CNN