Improving Low Compute Language Modeling with In-Domain Embedding   Initialisation

Charles Welch; Rada Mihalcea; Jonathan K. Kummerfeld

arXiv:2009.14109·cs.CL·October 1, 2020

Improving Low Compute Language Modeling with In-Domain Embedding Initialisation

Charles Welch, Rada Mihalcea, Jonathan K. Kummerfeld

PDF

1 Repo

TL;DR

This paper demonstrates that initializing and freezing input embeddings with in-domain data enhances language model performance in low-resource, domain-specific NLP tasks, especially for rare words, challenging standard embedding tying conventions.

Contribution

It introduces a method of in-domain embedding initialization and freezing for low-resource language modeling, showing its effectiveness across various domains.

Findings

01

In-domain embedding initialization improves perplexity.

02

Freezing embeddings benefits rare word representation.

03

Tying input and output embeddings does not improve perplexity with in-domain initialization.

Abstract

Many NLP applications, such as biomedical data and technical support, have 10-100 million tokens of in-domain data and limited computational resources for learning from it. How should we train a language model in this scenario? Most language modeling research considers either a small dataset with a closed vocabulary (like the standard 1 million token Penn Treebank), or the whole web with byte-pair encoding. We show that for our target setting in English, initialising and freezing input embeddings using in-domain data can improve language model performance by providing a useful representation of rare words, and this pattern holds across several different domains. In the process, we show that the standard convention of tying input and output embeddings does not improve perplexity when initializing with embeddings trained on in-domain data.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jkkummerfeld/emnlp20lm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.