Supervised Multimodal Bitransformers for Classifying Images and Text
Douwe Kiela, Suvrat Bhooshan, Hamed Firooz, Ethan Perez, Davide, Testuggine

TL;DR
This paper introduces a supervised multimodal bitransformer that effectively combines text and image data, achieving state-of-the-art results in multimodal classification tasks and surpassing existing models on challenging benchmarks.
Contribution
The paper presents a novel supervised multimodal bitransformer model that fuses text and image encoders for improved multimodal classification performance.
Findings
Achieved state-of-the-art results on multiple multimodal benchmarks.
Outperformed strong baselines on hard test sets.
Demonstrated effectiveness of multimodal fusion in classification tasks.
Abstract
Self-supervised bidirectional transformer models such as BERT have led to dramatic improvements in a wide variety of textual classification tasks. The modern digital world is increasingly multimodal, however, and textual information is often accompanied by other modalities such as images. We introduce a supervised multimodal bitransformer model that fuses information from text and image encoders, and obtain state-of-the-art performance on various multimodal classification benchmark tasks, outperforming strong baselines, including on hard test sets specifically designed to measure multimodal performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Text and Document Classification Technologies
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Byte Pair Encoding · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections
