Domain Regeneration: How well do LLMs match syntactic properties of text domains?
Da Ju, Hagen Blix, Adina Williams

TL;DR
This study assesses how well large language models replicate the syntactic properties of text domains like Wikipedia and news, revealing they tend to produce texts with less variability and simpler structures than human-authored content.
Contribution
It introduces a corpus linguistics-inspired method to evaluate LLMs' fidelity in reproducing syntactic properties of specific text domains.
Findings
LLMs produce texts with shifted mean syntactic properties.
Generated texts show reduced variability and fewer complex structures.
Most distributions are less diverse than original human texts.
Abstract
Recent improvement in large language model performance have, in all likelihood, been accompanied by improvement in how well they can approximate the distribution of their training data. In this work, we explore the following question: which properties of text domains do LLMs faithfully approximate, and how well do they do so? Applying observational approaches familiar from corpus linguistics, we prompt a commonly used, opensource LLM to regenerate text from two domains of permissively licensed English text which are often contained in LLM training data -- Wikipedia and news text. This regeneration paradigm allows us to investigate whether LLMs can faithfully match the original human text domains in a fairly semantically-controlled setting. We investigate varying levels of syntactic abstraction, from more simple properties like sentence length, and article readability, to more complex…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsText Readability and Simplification · Natural Language Processing Techniques · Topic Modeling
