Headless Language Models: Learning without Predicting with Contrastive Weight Tying
Nathan Godey, \'Eric de la Clergerie, Beno\^it Sagot

TL;DR
This paper introduces Contrastive Weight Tying, a novel pretraining method for headless language models that reconstructs embeddings contrastively, reducing computational costs and improving downstream task performance.
Contribution
It presents a new contrastive pretraining approach for headless language models that significantly lowers training costs and boosts accuracy across multiple benchmarks.
Findings
Reduced training computational requirements by up to 20 times.
Achieved +1.6 GLUE score improvement.
Attained +2.7 LAMBADA accuracy increase.
Abstract
Self-supervised pre-training of language models usually consists in predicting probability distributions over extensive token vocabularies. In this study, we propose an innovative method that shifts away from probability prediction and instead focuses on reconstructing input embeddings in a contrastive fashion via Constrastive Weight Tying (CWT). We apply this approach to pretrain Headless Language Models in both monolingual and multilingual contexts. Our method offers practical advantages, substantially reducing training computational requirements by up to 20 times, while simultaneously enhancing downstream performance and data efficiency. We observe a significant +1.6 GLUE score increase and a notable +2.7 LAMBADA accuracy improvement compared to classical LMs within similar compute budgets.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsWeight Tying
