You should evaluate your language model on marginal likelihood over tokenisations
Kris Cao, Laura Rimell

TL;DR
This paper proposes evaluating language models using marginal likelihood over tokenisations instead of a single tokenisation, showing it improves out-of-domain performance and better captures tokeniser uncertainty.
Contribution
It introduces methods to estimate marginal likelihood over tokenisations and demonstrates its advantages over traditional single tokenisation evaluation.
Findings
Marginal perplexity can be significantly lower than one-best perplexity.
Tokeniser entropy correlates with differences in perplexity.
Marginal likelihood estimation is feasible with a manageable number of samples.
Abstract
Neural language models typically tokenise input text into sub-word units to achieve an open vocabulary. The standard approach is to use a single canonical tokenisation at both train and test time. We suggest that this approach is unsatisfactory and may bottleneck our evaluation of language model performance. Using only the one-best tokenisation ignores tokeniser uncertainty over alternative tokenisations, which may hurt model out-of-domain performance. In this paper, we argue that instead, language models should be evaluated on their marginal likelihood over tokenisations. We compare different estimators for the marginal likelihood based on sampling, and show that it is feasible to estimate the marginal likelihood with a manageable number of samples. We then evaluate pretrained English and German language models on both the one-best-tokenisation and marginal perplexities, and show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
