TxT: Crossmodal End-to-End Learning with Transformers
Jan-Martin O. Steitz, Jonas Pfeiffer, Iryna Gurevych, Stefan Roth

TL;DR
TxT introduces a transformer-based end-to-end multimodal pipeline that jointly fine-tunes language and visual components, improving reasoning capabilities in tasks like Visual Question Answering by overcoming previous limitations of fixed visual features.
Contribution
The paper presents a novel transformer-based crossmodal architecture that enables fully end-to-end training and fine-tuning of both language and visual modules for multimodal reasoning.
Findings
Significant performance improvements in multimodal question answering.
Effective integration of global context in transformer-based visual detectors.
Scalability enhancements for multimodal reasoning models.
Abstract
Reasoning over multiple modalities, e.g. in Visual Question Answering (VQA), requires an alignment of semantic concepts across domains. Despite the widespread success of end-to-end learning, today's multimodal pipelines by and large leverage pre-extracted, fixed features from object detectors, typically Faster R-CNN, as representations of the visual world. The obvious downside is that the visual representation is not specifically tuned to the multimodal task at hand. At the same time, while transformer-based object detectors have gained popularity, they have not been employed in today's multimodal pipelines. We address both shortcomings with TxT, a transformer-based crossmodal pipeline that enables fine-tuning both language and visual components on the downstream task in a fully end-to-end manner. We overcome existing limitations of transformer-based detectors for multimodal reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsRoIPool · Convolution · Region Proposal Network · Softmax · Faster R-CNN
