MLP Architectures for Vision-and-Language Modeling: An Empirical Study
Yixin Nie, Linjie Li, Zhe Gan, Shuohang Wang, Chenguang Zhu, Michael, Zeng, Zicheng Liu, Mohit Bansal, Lijuan Wang

TL;DR
This paper empirically investigates the use of MLP architectures for vision-and-language fusion, demonstrating that with pre-training and slight modifications, MLPs can perform comparably to transformers and even outperform them in some pre-trained scenarios.
Contribution
It is the first comprehensive empirical study showing MLPs' effectiveness in VL tasks and explores the potential of all-MLP architectures for VL modeling.
Findings
Pre-training reduces the performance gap between MLPs and transformers.
Adding tiny one-head attention to MLPs achieves comparable results to transformers.
Pre-trained all-MLP models can outperform full-featured transformer models on average.
Abstract
We initiate the first empirical study on the use of MLP architectures for vision-and-language (VL) fusion. Through extensive experiments on 5 VL tasks and 5 robust VQA benchmarks, we find that: (i) Without pre-training, using MLPs for multimodal fusion has a noticeable performance gap compared to transformers; (ii) However, VL pre-training can help close the performance gap; (iii) Instead of heavy multi-head attention, adding tiny one-head attention to MLPs is sufficient to achieve comparable performance to transformers. Moreover, we also find that the performance gap between MLPs and transformers is not widened when being evaluated on the harder robust VQA benchmarks, suggesting using MLPs for VL fusion can generalize roughly to a similar degree as using transformers. These results hint that MLPs can effectively learn to align vision and text features extracted from lower-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
