Outperforming Good-Turing: Preliminary Report
Amichai Painsky, Meir Feder

TL;DR
This paper proposes a novel probability estimation method for large alphabets that groups symbols with similar frequencies, significantly improving accuracy over traditional estimators like Good-Turing.
Contribution
It introduces a new paradigm that assigns the same probability to symbols with similar frequencies, reducing parameters and enhancing estimation accuracy.
Findings
Up to 50% improvement in estimation accuracy
Ensemble of regulated estimators enhances performance
Method is publicly available for implementation
Abstract
Estimating a large alphabet probability distribution from a limited number of samples is a fundamental problem in machine learning and statistics. A variety of estimation schemes have been proposed over the years, mostly inspired by the early work of Laplace and the seminal contribution of Good and Turing. One of the basic assumptions shared by most commonly-used estimators is the unique correspondence between the symbol's sample frequency and its estimated probability. In this work we tackle this paradigmatic assumption; we claim that symbols with "similar" frequencies shall be assigned the same estimated probability value. This way we regulate the number of parameters and improve generalization. In this preliminary report we show that by applying an ensemble of such regulated estimators, we introduce a dramatic enhancement in the estimation accuracy (typically up to 50%), compared to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFractal and DNA sequence analysis · Neural Networks and Applications · Algorithms and Data Compression
