Partitioned Cross-Validation for Divide-and-Conquer Density Estimation
Anirban Bhattacharya, Jeffrey D. Hart

TL;DR
This paper introduces a partitioned cross-validation method for kernel density estimation that significantly improves computational efficiency and maintains statistical accuracy on large datasets, demonstrated on a dataset with 11 million observations.
Contribution
The paper proposes a novel partitioned cross-validation approach for bandwidth selection in density estimation, with theoretical analysis and practical validation on large datasets.
Findings
Substantial computational gains over ordinary cross-validation
Statistical efficiency comparable to traditional methods
Effective application to datasets with millions of observations
Abstract
We present an efficient method to estimate cross-validation bandwidth parameters for kernel density estimation in very large datasets where ordinary cross-validation is rendered highly inefficient, both statistically and computationally. Our approach relies on calculating multiple cross-validation bandwidths on partitions of the data, followed by suitable scaling and averaging to return a partitioned cross-validation bandwidth for the entire dataset. The partitioned cross-validation approach produces substantial computational gains over ordinary cross-validation. We additionally show that partitioned cross-validation can be statistically efficient compared to ordinary cross-validation. We derive analytic expressions for the asymptotically optimal number of partitions and study its finite sample accuracy through a detailed simulation study. We additionally propose a permuted version of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference · Gaussian Processes and Bayesian Inference · Bayesian Methods and Mixture Models
