Emergent Lexical Semantics in Neural Language Models: Testing Martin's Law on LLM-Generated Text
Kai Kugler

TL;DR
This paper investigates how neural language models develop lexical semantics related to Martin's Law during training, revealing a non-monotonic emergence and degradation pattern of semantic regularities across different model sizes and training checkpoints.
Contribution
It introduces a systematic methodology to evaluate the emergence of linguistic regularities in neural language models using clustering of contextual embeddings.
Findings
Martin's Law emerges around checkpoint 100.
Peak correlation (r > 0.6) occurs at checkpoint 104.
Larger models degrade more gracefully than smaller ones.
Abstract
We present the first systematic investigation of Martin's Law - the empirical relationship between word frequency and polysemy - in text generated by neural language models during training. Using DBSCAN clustering of contextualized embeddings as an operationalization of word senses, we analyze four Pythia models (70M-1B parameters) across 30 training checkpoints. Our results reveal a non-monotonic developmental trajectory: Martin's Law emerges around checkpoint 100, reaches peak correlation (r > 0.6) at checkpoint 104, then degrades by checkpoint 105. Smaller models (70M, 160M) experience catastrophic semantic collapse at late checkpoints, while larger models (410M, 1B) show graceful degradation. The frequency-specificity trade-off remains stable (r -0.3) across all models. These findings suggest that compliance with linguistic regularities in LLM-generated text is not…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeurobiology of Language and Bilingualism · Language and cultural evolution · Language Development and Disorders
