Language Model Evaluation Beyond Perplexity
Clara Meister, Ryan Cotterell

TL;DR
This paper introduces a new framework for evaluating language models based on how well their generated text matches the statistical tendencies of natural language, providing insights beyond traditional perplexity measures.
Contribution
It presents a novel evaluation method analyzing the statistical alignment of generated text with natural language trends, highlighting the influence of model architecture and generation strategy.
Findings
Neural models learn only some natural language tendencies.
Generated text with nucleus sampling aligns better with natural language statistics.
LSTM-based models reflect natural language distributions over length and stopwords.
Abstract
We propose an alternate approach to quantifying how well language models learn natural language: we ask how well they match the statistical tendencies of natural language. To answer this question, we analyze whether text generated from language models exhibits the statistical tendencies present in the human-generated text on which they were trained. We provide a framework--paired with significance tests--for evaluating the fit of language models to these trends. We find that neural language models appear to learn only a subset of the tendencies considered, but align much more closely with empirical trends than proposed theoretical distributions (when present). Further, the fit to different distributions is highly-dependent on both model architecture and generation strategy. As concrete examples, text generated under the nucleus sampling scheme adheres more closely to the type--token…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
