Estimating the Entropy of Linguistic Distributions

Aryaman Arora; Clara Meister; Ryan Cotterell

arXiv:2204.01469·cs.CL·April 6, 2022

Estimating the Entropy of Linguistic Distributions

Aryaman Arora, Clara Meister, Ryan Cotterell

PDF

TL;DR

This paper evaluates various entropy estimators for linguistic data, revealing that many prior studies overestimated effects due to poor estimation methods, and offers practical recommendations for future research.

Contribution

It provides a comprehensive analysis of entropy estimators' effectiveness on linguistic data and offers guidelines tailored to different data conditions.

Findings

01

Many existing studies overestimate effect sizes due to poor entropy estimation.

02

Certain estimators perform better depending on data distribution and size.

03

Recommendations improve accuracy of entropy estimation in linguistic research.

Abstract

Shannon entropy is often a quantity of interest to linguists studying the communicative capacity of human language. However, entropy must typically be estimated from observed data because researchers do not have access to the underlying probability distribution that gives rise to these data. While entropy estimation is a well-studied problem in other fields, there is not yet a comprehensive exploration of the efficacy of entropy estimators for use with linguistic data. In this work, we fill this void, studying the empirical effectiveness of different entropy estimators for linguistic distributions. In a replication of two recent information-theoretic linguistic studies, we find evidence that the reported effect size is over-estimated due to over-reliance on poor entropy estimators. Finally, we end our paper with concrete recommendations for entropy estimation depending on distribution…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.