How much do language models copy from their training data? Evaluating   linguistic novelty in text generation using RAVEN

R. Thomas McCoy; Paul Smolensky; Tal Linzen; Jianfeng Gao; Asli; Celikyilmaz

arXiv:2111.09509·cs.CL·November 19, 2021

How much do language models copy from their training data? Evaluating linguistic novelty in text generation using RAVEN

R. Thomas McCoy, Paul Smolensky, Tal Linzen, Jianfeng Gao, Asli, Celikyilmaz

PDF

Open Access

TL;DR

This paper introduces RAVEN, a suite of analyses to evaluate the linguistic novelty of text generated by language models, revealing that models often copy large passages but produce syntactically well-formed yet sometimes semantically flawed text.

Contribution

The paper presents RAVEN, a novel framework for assessing the extent of copying versus abstraction in language model outputs across different structural levels.

Findings

01

Models copy large passages from training data in some cases.

02

Generated text is often syntactically well-formed but semantically inconsistent.

03

Larger-scale structures in generated text show high novelty, comparable to or exceeding human text.

Abstract

Current language models can generate high-quality text. Are they simply copying text they have seen before, or have they learned generalizable linguistic abstractions? To tease apart these possibilities, we introduce RAVEN, a suite of analyses for assessing the novelty of generated text, focusing on sequential structure (n-grams) and syntactic structure. We apply these analyses to four neural language models (an LSTM, a Transformer, Transformer-XL, and GPT-2). For local structure - e.g., individual dependencies - model-generated text is substantially less novel than our baseline of human-generated text from each model's test set. For larger-scale structure - e.g., overall sentence structure - model-generated text is as novel or even more novel than the human-generated baseline, but models still sometimes copy substantially, in some cases duplicating passages over 1,000 words long from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification

MethodsMulti-Head Attention · Attention Is All You Need · *Communicated@Fast*How Do I Communicate to Expedia? · Linear Layer · Tanh Activation · Cosine Annealing · Adaptive Input Representations · Position-Wise Feed-Forward Layer · Layer Normalization · Label Smoothing