Assessing Language Models with Scaling Properties
Shuntaro Takahashi, Kumiko Tanaka-Ishii

TL;DR
This paper introduces new evaluation methods for language models based on their ability to capture natural language's scaling properties, revealing limitations of current models in long memory tasks.
Contribution
It proposes five novel tests based on scaling properties to evaluate language models beyond perplexity, highlighting the limited long memory capabilities of neural models.
Findings
Neural models show some long memory but are limited
Traditional models like n-grams lack long memory properties
Evaluation methods reveal qualitative differences in models' understanding
Abstract
Language models have primarily been evaluated with perplexity. While perplexity quantifies the most comprehensible prediction performance, it does not provide qualitative information on the success or failure of models. Another approach for evaluating language models is thus proposed, using the scaling properties of natural language. Five such tests are considered, with the first two accounting for the vocabulary population and the other three for the long memory of natural language. The following models were evaluated with these tests: n-grams, probabilistic context-free grammar (PCFG), Simon and Pitman-Yor (PY) processes, hierarchical PY, and neural language models. Only the neural language models exhibit the long memory properties of natural language, but to a limited degree. The effectiveness of every test of these models is also discussed.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Neural Networks and Applications
