MDAPT: Multilingual Domain Adaptive Pretraining in a Single Model

Rasmus K{\ae}r J{\o}rgensen; Mareike Hartmann; Xiang Dai and; Desmond Elliott

arXiv:2109.06605·cs.CL·September 15, 2021

MDAPT: Multilingual Domain Adaptive Pretraining in a Single Model

Rasmus K{\ae}r J{\o}rgensen, Mareike Hartmann, Xiang Dai and, Desmond Elliott

PDF

Open Access 1 Repo

TL;DR

This paper investigates how to effectively adapt a single multilingual language model to specific domains across multiple languages, demonstrating that such models can outperform general multilingual models and approach monolingual performance.

Contribution

It introduces techniques for domain adaptive pretraining in a multilingual setting, enabling a single model to excel in domain-specific tasks across multiple languages.

Findings

01

Multilingual domain-adaptive models outperform general multilingual models.

02

Single models perform close to monolingual models in domain-specific tasks.

03

Techniques work across adapter-based and full model pretraining methods.

Abstract

Domain adaptive pretraining, i.e. the continued unsupervised pretraining of a language model on domain-specific text, improves the modelling of text for downstream tasks within the domain. Numerous real-world applications are based on domain-specific text, e.g. working with financial or biomedical documents, and these applications often need to support multiple languages. However, large-scale domain-specific multilingual pretraining data for such scenarios can be difficult to obtain, due to regulations, legislation, or simply a lack of language- and domain-specific text. One solution is to train a single multilingual model, taking advantage of the data available in as many languages as possible. In this work, we explore the benefits of domain adaptive pretraining with a focus on adapting to multiple languages within a specific domain. We propose different techniques to compose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rasmuskaer/mdapt_supplements
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification