Optimal Data-Based Binning for Histograms

Kevin H. Knuth

arXiv:physics/0605197·physics.data-an·September 17, 2013·137 cites

Optimal Data-Based Binning for Histograms

Kevin H. Knuth

PDF

Open Access

TL;DR

This paper presents a simple, data-driven method for optimally selecting the number of bins in histograms, improving density estimation accuracy across various data scenarios.

Contribution

It introduces a Bayesian approach to determine the optimal bin count using a multinomial likelihood and non-informative prior, applicable to multi-dimensional histograms.

Findings

01

The method effectively estimates the optimal number of bins.

02

It accounts for small sample sizes and digitized data effects.

03

Demonstrates applicability to multi-dimensional histograms.

Abstract

Histograms are convenient non-parametric density estimators, which continue to be used ubiquitously. Summary quantities estimated from histogram-based probability density models depend on the choice of the number of bins. We introduce a straightforward data-based method of determining the optimal number of bins in a uniform bin-width histogram. By assigning a multinomial likelihood and a non-informative prior, we derive the posterior probability for the number of bins in a piecewise-constant density model given the data. In addition, we estimate the mean and standard deviations of the resulting bin heights, examine the effects of small sample sizes and digitized data, and demonstrate the application to multi-dimensional histograms.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Machine Learning and Algorithms · Algorithms and Data Compression