Multi-stage Pre-training over Simplified Multimodal Pre-training Models
Tongtong Liu, Fangxiang Feng, Xiaojie Wang

TL;DR
This paper introduces a multi-stage pre-training approach for simplified multimodal models, enabling comparable or better performance with significantly less data and smaller models, especially in image-text retrieval tasks.
Contribution
The paper proposes a novel multi-stage pre-training method that leverages information at multiple granularities, reducing data and model size requirements while maintaining high performance.
Findings
Achieves comparable performance to LXMERT on downstream tasks.
Outperforms LXMERT in Image-Text Retrieval.
Uses only 11.76% of original pre-training data.
Abstract
Multimodal pre-training models, such as LXMERT, have achieved excellent results in downstream tasks. However, current pre-trained models require large amounts of training data and have huge model sizes, which make them difficult to apply in low-resource situations. How to obtain similar or even better performance than a larger model under the premise of less pre-training data and smaller model size has become an important problem. In this paper, we propose a new Multi-stage Pre-training (MSP) method, which uses information at different granularities from word, phrase to sentence in both texts and images to pre-train the model in stages. We also design several different pre-training tasks suitable for the information granularity in different stage in order to efficiently capture the diverse knowledge from a limited corpus. We take a Simplified LXMERT (LXMERT- S), which has only 45.9%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
MethodsLearning Cross-Modality Encoder Representations from Transformers
