You should evaluate your language model on marginal likelihood over   tokenisations

Kris Cao; Laura Rimell

arXiv:2109.02550·cs.CL·September 22, 2021

You should evaluate your language model on marginal likelihood over tokenisations

Kris Cao, Laura Rimell

PDF

Open Access

TL;DR

This paper proposes evaluating language models using marginal likelihood over tokenisations instead of a single tokenisation, showing it improves out-of-domain performance and better captures tokeniser uncertainty.

Contribution

It introduces methods to estimate marginal likelihood over tokenisations and demonstrates its advantages over traditional single tokenisation evaluation.

Findings

01

Marginal perplexity can be significantly lower than one-best perplexity.

02

Tokeniser entropy correlates with differences in perplexity.

03

Marginal likelihood estimation is feasible with a manageable number of samples.

Abstract

Neural language models typically tokenise input text into sub-word units to achieve an open vocabulary. The standard approach is to use a single canonical tokenisation at both train and test time. We suggest that this approach is unsatisfactory and may bottleneck our evaluation of language model performance. Using only the one-best tokenisation ignores tokeniser uncertainty over alternative tokenisations, which may hurt model out-of-domain performance. In this paper, we argue that instead, language models should be evaluated on their marginal likelihood over tokenisations. We compare different estimators for the marginal likelihood based on sampling, and show that it is feasible to estimate the marginal likelihood with a manageable number of samples. We then evaluate pretrained English and German language models on both the one-best-tokenisation and marginal perplexities, and show…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification