Sentence Length
G\'abor Borb\'ely, Andr\'as Kornai

TL;DR
This paper introduces a new random walk model for sentence length distribution that outperforms previous models, providing better data fit and understanding, along with a Bayesian comparison method with minimal complexity.
Contribution
It presents a novel random walk model for sentence length, a generalized KL divergence, and a Bayesian model comparison approach with minimal description length requirements.
Findings
The new model fits sentence length data better than existing models.
The Bayesian comparison method is hyperparameter-free and conceptually linked to MDL.
Models require significantly fewer bits than naive nonparametric models.
Abstract
The distribution of sentence length in ordinary language is not well captured by the existing models. Here we survey previous models of sentence length and present our random walk model that offers both a better fit with the data and a better understanding of the distribution. We develop a generalization of KL divergence, discuss measuring the noise inherent in a corpus, and present a hyperparameter-free Bayesian model comparison method that has strong conceptual ties to Minimal Description Length modeling. The models we obtain require only a few dozen bits, orders of magnitude less than the naive nonparametric MDL models would.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLinguistics and Discourse Analysis
MethodsMinimum Description Length
