The predictability of letters in written english

Thomas Sch\"urmann; Peter Grassberger

arXiv:0710.4516·physics.soc-ph·April 24, 2017

The predictability of letters in written english

Thomas Sch\"urmann, Peter Grassberger

PDF

TL;DR

This paper investigates how the predictability of letters in written English varies with their position within words, revealing that internal letters are significantly more predictable than initial letters, reflecting the subunit structure of words.

Contribution

It demonstrates the strong dependence of letter predictability on position within words and quantifies the entropy differences between initial and internal letters.

Findings

01

First letters are least predictable.

02

Entropy inside words is about four times smaller than for first letters.

03

Words act as well-defined subunits with weaker cross-unit correlations.

Abstract

We show that the predictability of letters in written English texts depends strongly on their position in the word. The first letters are usually the least easy to predict. This agrees with the intuitive notion that words are well defined subunits in written languages, with much weaker correlations across these units than within them. It implies that the average entropy of a letter deep inside a word is roughly 4 times smaller than the entropy of the first letter.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.