Multi-stage Pre-training over Simplified Multimodal Pre-training Models

Tongtong Liu; Fangxiang Feng; Xiaojie Wang

arXiv:2107.14596·cs.CL·August 2, 2021

Multi-stage Pre-training over Simplified Multimodal Pre-training Models

Tongtong Liu, Fangxiang Feng, Xiaojie Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a multi-stage pre-training approach for simplified multimodal models, enabling comparable or better performance with significantly less data and smaller models, especially in image-text retrieval tasks.

Contribution

The paper proposes a novel multi-stage pre-training method that leverages information at multiple granularities, reducing data and model size requirements while maintaining high performance.

Findings

01

Achieves comparable performance to LXMERT on downstream tasks.

02

Outperforms LXMERT in Image-Text Retrieval.

03

Uses only 11.76% of original pre-training data.

Abstract

Multimodal pre-training models, such as LXMERT, have achieved excellent results in downstream tasks. However, current pre-trained models require large amounts of training data and have huge model sizes, which make them difficult to apply in low-resource situations. How to obtain similar or even better performance than a larger model under the premise of less pre-training data and smaller model size has become an important problem. In this paper, we propose a new Multi-stage Pre-training (MSP) method, which uses information at different granularities from word, phrase to sentence in both texts and images to pre-train the model in stages. We also design several different pre-training tasks suitable for the information granularity in different stage in order to efficiently capture the diverse knowledge from a limited corpus. We take a Simplified LXMERT (LXMERT- S), which has only 45.9%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lttsmn/LXMERT-S
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling

MethodsLearning Cross-Modality Encoder Representations from Transformers