Evaluating $n$-Gram Novelty of Language Models Using Rusty-DAWG
William Merrill, Noah A. Smith, Yanai Elazar

TL;DR
This paper introduces Rusty-DAWG, a new tool for efficiently analyzing the $n$-gram novelty in language model outputs, revealing how model size and decoding strategies influence text originality compared to human writing.
Contribution
The paper presents Rusty-DAWG, a novel indexing tool for arbitrary-length $n$-gram search, and provides empirical insights into factors affecting language model text novelty.
Findings
LM-generated text is less novel than human text for $n > 4$
Larger models and constrained decoding reduce novelty
LMs better complete frequent $n$-grams with lower loss
Abstract
How novel are texts generated by language models (LMs) relative to their training corpora? In this work, we investigate the extent to which modern LMs generate -grams from their training data, evaluating both (i) the probability LMs assign to complete training -grams and (ii) -novelty, the proportion of -grams generated by an LM that did not appear in the training data (for arbitrarily large ). To enable arbitrary-length -gram search over a corpus in constant time w.r.t. corpus size, we develop Rusty-DAWG, a novel search tool inspired by indexing of genomic data. We compare the novelty of LM-generated text to human-written text and explore factors that affect generation novelty, focusing on the Pythia models. We find that, for , LM-generated text is less novel than human-written text, though it is more novel for smaller . Larger LMs and more constrained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Language and cultural evolution
MethodsPythia
