Evaluating $n$-Gram Novelty of Language Models Using Rusty-DAWG

William Merrill; Noah A. Smith; Yanai Elazar

arXiv:2406.13069·cs.CL·August 26, 2025

Evaluating $n$-Gram Novelty of Language Models Using Rusty-DAWG

William Merrill, Noah A. Smith, Yanai Elazar

PDF

Open Access 1 Repo

TL;DR

This paper introduces Rusty-DAWG, a new tool for efficiently analyzing the $n$-gram novelty in language model outputs, revealing how model size and decoding strategies influence text originality compared to human writing.

Contribution

The paper presents Rusty-DAWG, a novel indexing tool for arbitrary-length $n$-gram search, and provides empirical insights into factors affecting language model text novelty.

Findings

01

LM-generated text is less novel than human text for $n > 4$

02

Larger models and constrained decoding reduce novelty

03

LMs better complete frequent $n$-grams with lower loss

Abstract

How novel are texts generated by language models (LMs) relative to their training corpora? In this work, we investigate the extent to which modern LMs generate $n$ -grams from their training data, evaluating both (i) the probability LMs assign to complete training $n$ -grams and (ii) $n$ -novelty, the proportion of $n$ -grams generated by an LM that did not appear in the training data (for arbitrarily large $n$ ). To enable arbitrary-length $n$ -gram search over a corpus in constant time w.r.t. corpus size, we develop Rusty-DAWG, a novel search tool inspired by indexing of genomic data. We compare the novelty of LM-generated text to human-written text and explore factors that affect generation novelty, focusing on the Pythia models. We find that, for $n > 4$ , LM-generated text is less novel than human-written text, though it is more novel for smaller $n$ . Larger LMs and more constrained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

viking-sudo-rm/rusty-dawg
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Language and cultural evolution

MethodsPythia