Better estimates from binned income data: Interpolated CDFs and mean-matching
Paul T. von Hippel, David J. Hunter, and McKalie Drown

TL;DR
This paper introduces a fast, nonparametric method using interpolated CDFs to estimate income statistics from binned data, outperforming traditional midpoint approaches and fitting well with known means.
Contribution
It proposes a novel nonparametric interpolation method for income data, improving accuracy and speed over parametric models, and provides implementations in R and Stata.
Findings
Interpolated CDFs are faster and slightly more accurate than parametric fits.
Constraining methods to match known means significantly improve estimates.
The approach performs well across all US counties in estimating Gini coefficients.
Abstract
Researchers often estimate income statistics from summaries that report the number of incomes in bins such as $0-10,000, $10,001-20,000,...,$200,000+. Some analysts assign incomes to bin midpoints, but this treats income as discrete. Other analysts fit a continuous parametric distribution, but the distribution may not fit well. We fit nonparametric continuous distributions that reproduce the bin counts perfectly by interpolating the cumulative distribution function (CDF). We also show how both midpoints and interpolated CDFs can be constrained to reproduce the mean of income when it is known. We compare the methods' accuracy in estimating the Gini coefficients of all 3,221 US counties. Fitting parametric distributions is very slow. Fitting interpolated CDFs is much faster and slightly more accurate. Both interpolated CDFs and midpoints give dramatically better estimates if…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
