Detection and Measurement of Syntactic Templates in Generated Text

Chantal Shaib; Yanai Elazar; Junyi Jessy Li; Byron C. Wallace

arXiv:2407.00211·cs.CL·October 8, 2024

Detection and Measurement of Syntactic Templates in Generated Text

Chantal Shaib, Yanai Elazar, Junyi Jessy Li, Byron C. Wallace

PDF

Open Access

TL;DR

This paper analyzes syntactic templates in generated text from language models, revealing their prevalence, origin from pre-training data, and utility for evaluating model behavior and memorization.

Contribution

It introduces a method to identify and analyze syntactic templates in generated text, linking them to pre-training data and demonstrating their use in model evaluation.

Findings

01

76% of templates in generated text are from pre-training data

02

Templates distinguish different models, tasks, and domains

03

Templates help analyze style memorization in LLMs

Abstract

Recent work on evaluating the diversity of text generated by LLMs has focused on word-level features. Here we offer an analysis of syntactic features to characterize general repetition in models, beyond frequent n-grams. Specifically, we define syntactic templates and show that models tend to produce templated text in downstream tasks at a higher rate than what is found in human-reference texts. We find that most (76%) templates in model-generated text can be found in pre-training data (compared to only 35% of human-authored text), and are not overwritten during fine-tuning processes such as RLHF. This connection to the pre-training data allows us to analyze syntactic templates in models where we do not have the pre-training data. We also find that templates as features are able to differentiate between models, tasks, and domains, and are useful for qualitatively evaluating common model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification