Sentence Length

G\'abor Borb\'ely; Andr\'as Kornai

arXiv:1905.09139·cs.CL·May 23, 2019·1 cites

Sentence Length

G\'abor Borb\'ely, Andr\'as Kornai

PDF

Open Access 1 Repo

TL;DR

This paper introduces a new random walk model for sentence length distribution that outperforms previous models, providing better data fit and understanding, along with a Bayesian comparison method with minimal complexity.

Contribution

It presents a novel random walk model for sentence length, a generalized KL divergence, and a Bayesian model comparison approach with minimal description length requirements.

Findings

01

The new model fits sentence length data better than existing models.

02

The Bayesian comparison method is hyperparameter-free and conceptually linked to MDL.

03

Models require significantly fewer bits than naive nonparametric models.

Abstract

The distribution of sentence length in ordinary language is not well captured by the existing models. Here we survey previous models of sentence length and present our random walk model that offers both a better fit with the data and a better understanding of the distribution. We develop a generalization of KL divergence, discuss measuring the noise inherent in a corpus, and present a hyperparameter-free Bayesian model comparison method that has strong conceptual ties to Minimal Description Length modeling. The models we obtain require only a few dozen bits, orders of magnitude less than the naive nonparametric MDL models would.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hlt-bme-hu/SentenceLength
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLinguistics and Discourse Analysis

MethodsMinimum Description Length