TL;DR
This paper demonstrates that initializing and freezing input embeddings with in-domain data enhances language model performance in low-resource, domain-specific NLP tasks, especially for rare words, challenging standard embedding tying conventions.
Contribution
It introduces a method of in-domain embedding initialization and freezing for low-resource language modeling, showing its effectiveness across various domains.
Findings
In-domain embedding initialization improves perplexity.
Freezing embeddings benefits rare word representation.
Tying input and output embeddings does not improve perplexity with in-domain initialization.
Abstract
Many NLP applications, such as biomedical data and technical support, have 10-100 million tokens of in-domain data and limited computational resources for learning from it. How should we train a language model in this scenario? Most language modeling research considers either a small dataset with a closed vocabulary (like the standard 1 million token Penn Treebank), or the whole web with byte-pair encoding. We show that for our target setting in English, initialising and freezing input embeddings using in-domain data can improve language model performance by providing a useful representation of rare words, and this pattern holds across several different domains. In the process, we show that the standard convention of tying input and output embeddings does not improve perplexity when initializing with embeddings trained on in-domain data.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
