How much do language models copy from their training data? Evaluating linguistic novelty in text generation using RAVEN
R. Thomas McCoy, Paul Smolensky, Tal Linzen, Jianfeng Gao, Asli, Celikyilmaz

TL;DR
This paper introduces RAVEN, a suite of analyses to evaluate the linguistic novelty of text generated by language models, revealing that models often copy large passages but produce syntactically well-formed yet sometimes semantically flawed text.
Contribution
The paper presents RAVEN, a novel framework for assessing the extent of copying versus abstraction in language model outputs across different structural levels.
Findings
Models copy large passages from training data in some cases.
Generated text is often syntactically well-formed but semantically inconsistent.
Larger-scale structures in generated text show high novelty, comparable to or exceeding human text.
Abstract
Current language models can generate high-quality text. Are they simply copying text they have seen before, or have they learned generalizable linguistic abstractions? To tease apart these possibilities, we introduce RAVEN, a suite of analyses for assessing the novelty of generated text, focusing on sequential structure (n-grams) and syntactic structure. We apply these analyses to four neural language models (an LSTM, a Transformer, Transformer-XL, and GPT-2). For local structure - e.g., individual dependencies - model-generated text is substantially less novel than our baseline of human-generated text from each model's test set. For larger-scale structure - e.g., overall sentence structure - model-generated text is as novel or even more novel than the human-generated baseline, but models still sometimes copy substantially, in some cases duplicating passages over 1,000 words long from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsMulti-Head Attention · Attention Is All You Need · *Communicated@Fast*How Do I Communicate to Expedia? · Linear Layer · Tanh Activation · Cosine Annealing · Adaptive Input Representations · Position-Wise Feed-Forward Layer · Layer Normalization · Label Smoothing
