Optimal Data-Based Binning for Histograms
Kevin H. Knuth

TL;DR
This paper presents a simple, data-driven method for optimally selecting the number of bins in histograms, improving density estimation accuracy across various data scenarios.
Contribution
It introduces a Bayesian approach to determine the optimal bin count using a multinomial likelihood and non-informative prior, applicable to multi-dimensional histograms.
Findings
The method effectively estimates the optimal number of bins.
It accounts for small sample sizes and digitized data effects.
Demonstrates applicability to multi-dimensional histograms.
Abstract
Histograms are convenient non-parametric density estimators, which continue to be used ubiquitously. Summary quantities estimated from histogram-based probability density models depend on the choice of the number of bins. We introduce a straightforward data-based method of determining the optimal number of bins in a uniform bin-width histogram. By assigning a multinomial likelihood and a non-informative prior, we derive the posterior probability for the number of bins in a piecewise-constant density model given the data. In addition, we estimate the mean and standard deviations of the resulting bin heights, examine the effects of small sample sizes and digitized data, and demonstrate the application to multi-dimensional histograms.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Machine Learning and Algorithms · Algorithms and Data Compression
