Emergent Lexical Semantics in Neural Language Models: Testing Martin's Law on LLM-Generated Text

Kai Kugler

arXiv:2511.21334·cs.CL·November 27, 2025

Emergent Lexical Semantics in Neural Language Models: Testing Martin's Law on LLM-Generated Text

Kai Kugler

PDF

Open Access

TL;DR

This paper investigates how neural language models develop lexical semantics related to Martin's Law during training, revealing a non-monotonic emergence and degradation pattern of semantic regularities across different model sizes and training checkpoints.

Contribution

It introduces a systematic methodology to evaluate the emergence of linguistic regularities in neural language models using clustering of contextual embeddings.

Findings

01

Martin's Law emerges around checkpoint 100.

02

Peak correlation (r > 0.6) occurs at checkpoint 104.

03

Larger models degrade more gracefully than smaller ones.

Abstract

We present the first systematic investigation of Martin's Law - the empirical relationship between word frequency and polysemy - in text generated by neural language models during training. Using DBSCAN clustering of contextualized embeddings as an operationalization of word senses, we analyze four Pythia models (70M-1B parameters) across 30 training checkpoints. Our results reveal a non-monotonic developmental trajectory: Martin's Law emerges around checkpoint 100, reaches peak correlation (r > 0.6) at checkpoint 104, then degrades by checkpoint 105. Smaller models (70M, 160M) experience catastrophic semantic collapse at late checkpoints, while larger models (410M, 1B) show graceful degradation. The frequency-specificity trade-off remains stable (r $\approx$ -0.3) across all models. These findings suggest that compliance with linguistic regularities in LLM-generated text is not…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeurobiology of Language and Bilingualism · Language and cultural evolution · Language Development and Disorders