Long-Range Correlation Underlying Childhood Language and Generative Models
Kumiko Tanaka-Ishii

TL;DR
This paper investigates long-range correlations in childhood language data and explores how certain generative models can replicate these correlations, revealing insights into linguistic processes and model design.
Contribution
It demonstrates long-range correlation in childhood language data and introduces a new model combining Simon and Pitman-Yor models to replicate these correlations with correct vocabulary growth.
Findings
Long-range correlation exists in childhood language data.
The Simon model exhibits strong long-range correlation.
A new combined model maintains long-range correlation with proper vocabulary growth.
Abstract
Long-range correlation, a property of time series exhibiting long-term memory, is mainly studied in the statistical physics domain and has been reported to exist in natural language. Using a state-of-the-art method for such analysis, long-range correlation is first shown to occur in long CHILDES data sets. To understand why, Bayesian generative models of language, originally proposed in the cognitive scientific domain, are investigated. Among representative models, the Simon model was found to exhibit surprisingly good long-range correlation, but not the Pitman-Yor model. Since the Simon model is known not to correctly reflect the vocabulary growth of natural language, a simple new model is devised as a conjunct of the Simon and Pitman-Yor models, such that long-range correlation holds with a correct vocabulary growth rate. The investigation overall suggests that uniform sampling is one…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
