Outperforming Good-Turing: Preliminary Report

Amichai Painsky; Meir Feder

arXiv:1807.02287·stat.ML·August 20, 2018

Outperforming Good-Turing: Preliminary Report

Amichai Painsky, Meir Feder

PDF

Open Access

TL;DR

This paper proposes a novel probability estimation method for large alphabets that groups symbols with similar frequencies, significantly improving accuracy over traditional estimators like Good-Turing.

Contribution

It introduces a new paradigm that assigns the same probability to symbols with similar frequencies, reducing parameters and enhancing estimation accuracy.

Findings

01

Up to 50% improvement in estimation accuracy

02

Ensemble of regulated estimators enhances performance

03

Method is publicly available for implementation

Abstract

Estimating a large alphabet probability distribution from a limited number of samples is a fundamental problem in machine learning and statistics. A variety of estimation schemes have been proposed over the years, mostly inspired by the early work of Laplace and the seminal contribution of Good and Turing. One of the basic assumptions shared by most commonly-used estimators is the unique correspondence between the symbol's sample frequency and its estimated probability. In this work we tackle this paradigmatic assumption; we claim that symbols with "similar" frequencies shall be assigned the same estimated probability value. This way we regulate the number of parameters and improve generalization. In this preliminary report we show that by applying an ensemble of such regulated estimators, we introduce a dramatic enhancement in the estimation accuracy (typically up to 50%), compared to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFractal and DNA sequence analysis · Neural Networks and Applications · Algorithms and Data Compression