Robustness of sentence length measures in written texts

Denner S. Vieira; Sergio Picoli; and Renio S. Mendes

arXiv:1805.01460·cs.CL·May 7, 2018

Robustness of sentence length measures in written texts

Denner S. Vieira, Sergio Picoli, and Renio S. Mendes

PDF

TL;DR

This study investigates the robustness of various sentence length measures in written texts by analyzing a large corpus of books, finding that different measures yield similar structural insights.

Contribution

The paper systematically compares six sentence length measures across many books, demonstrating their consistent behavior and robustness in capturing text structure.

Findings

01

All six measures show high correlation and similar distribution patterns.

02

Sentence length measures exhibit consistent auto-correlation properties.

03

Different measures are interchangeable for analyzing text structure.

Abstract

Hidden structural patterns in written texts have been subject of considerable research in the last decades. In particular, mapping a text into a time series of sentence lengths is a natural way to investigate text structure. Typically, sentence length has been quantified by using measures based on the number of words and the number of characters, but other variations are possible. To quantify the robustness of different sentence length measures, we analyzed a database containing about five hundred books in English. For each book, we extracted six distinct measures of sentence length, including number of words and number of characters (taking into account lemmatization and stop words removal). We compared these six measures for each book by using i) Pearson's coefficient to investigate linear correlations; ii) Kolmogorov--Smirnov test to compare distributions; and iii) detrended…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.