Towards Fully Bilingual Deep Language Modeling
Li-Hsin Chang, Sampo Pyysalo, Jenna Kanerva, Filip Ginter

TL;DR
This paper demonstrates that it is feasible to pre-train a fully bilingual deep language model for two remotely related languages without sacrificing performance, achieving results comparable to monolingual models by increasing vocabulary size.
Contribution
The study shows that a bilingual BERT model can match monolingual performance on both languages, challenging the notion that multilinguality necessarily reduces monolingual effectiveness.
Findings
Bilingual BERT performs on par with English BERT on GLUE.
It nearly matches Finnish BERT on Finnish NLP tasks.
Increasing vocabulary size enables learning two languages effectively.
Abstract
Language models based on deep neural networks have facilitated great advances in natural language processing and understanding tasks in recent years. While models covering a large number of languages have been introduced, their multilinguality has come at a cost in terms of monolingual performance, and the best-performing models at most tasks not involving cross-lingual transfer remain monolingual. In this paper, we consider the question of whether it is possible to pre-train a bilingual model for two remotely related languages without compromising performance at either language. We collect pre-training data, create a Finnish-English bilingual BERT model and evaluate its performance on datasets used to evaluate the corresponding monolingual models. Our bilingual model performs on par with Google's original English BERT on GLUE and nearly matches the performance of monolingual Finnish…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsLinear Layer · Adam · Layer Normalization · Dense Connections · Multi-Head Attention · Refunds@Expedia|||How do I get a full refund from Expedia? · Dropout · Linear Warmup With Linear Decay · Attention Dropout · Weight Decay
