Self-Training Doesn't Flatten Language -- It Restructures It: Surface Markers Amplify While Deep Syntax Dies

Ming Liu

arXiv:2605.20602·cs.CL·May 21, 2026

Self-Training Doesn't Flatten Language -- It Restructures It: Surface Markers Amplify While Deep Syntax Dies

Ming Liu

PDF

TL;DR

Self-training restructures language by amplifying surface markers and collapsing deep syntactic structures, challenging the notion of uniform flattening in language models.

Contribution

This study formalizes the Structural Depth Hypothesis, showing how self-training differentially affects linguistic features based on their structural depth.

Findings

01

Surface markers increase while deep syntactic features collapse during self-training.

02

Structural depth predicts feature decay better than frequency, with a significant correlation (rho=0.540).

03

Self-training-specific effects are confirmed by a control with human-text fine-tuning.

Abstract

Successive self-training on a language model's own outputs is widely characterized as a process of flattening: diversity drops, distributions narrow, and the text becomes "more like itself." We provide evidence that this characterization is incomplete. Across eleven generations of self-training on five models (GPT-2 124M, Pythia-410M, Pythia-1.4B, OPT-1.3B, Pythia-2.8B), language is not flattened uniformly -- it is restructured. Surface markers (discourse connectives, hedges, em-dashes) rise, while mid- and deep-syntactic structures (questions, parentheticals, passives, subjunctives) collapse. We formalize this asymmetric collapse as the Structural Depth Hypothesis (SDH): the per-generation decay rate of a linguistic feature is predicted primarily by its structural depth -- the number of nested syntactic dependencies it requires -- and only secondarily by its generation-zero output…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.