Quadratic Term Correction on Heaps' Law
Oscar Fontanelli, Wentian Li

TL;DR
This paper demonstrates that quadratic functions in log-log scale better fit Heaps' law data, revealing a slight concavity in the word-type versus word-token relationship across various texts.
Contribution
It introduces a quadratic correction to Heaps' law, providing a more accurate model for the type-token relationship in language data.
Findings
Quadratic functions fit the type-token data perfectly.
Regression coefficients indicate a slight concavity in the log-log scale.
A pseudo-variance explains the curvature observed in the data.
Abstract
Heaps' or Herdan's law characterizes the word-type vs. word-token relation by a power-law function, which is concave in linear-linear scale but a straight line in log-log scale. However, it has been observed that even in log-log scale, the type-token curve is still slightly concave, invalidating the power-law relation. At the next-order approximation, we have shown, by twenty English novels or writings (some are translated from another language to English), that quadratic functions in log-log scale fit the type-token data perfectly. Regression analyses of log(type)-log(token) data with both a linear and quadratic term consistently lead to a linear coefficient of slightly larger than 1, and a quadratic coefficient around -0.02. Using the ``random drawing colored ball from the bag with replacement" model, we have shown that the curvature of the log-log scale is identical to a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Data Analysis with R · Complex Systems and Time Series Analysis
