Flash-SD-KDE: Accelerating SD-KDE with Tensor Cores
Elliot L. Epstein, Rajat Vadiraj Dwaraknath, John Winnicki

TL;DR
This paper introduces Flash-SD-KDE, a GPU-accelerated method for score-debiased kernel density estimation that significantly improves speed, enabling practical large-scale applications.
Contribution
The paper re-structures SD-KDE computations to leverage Tensor Cores, achieving substantial speedups over existing implementations and making large-scale SD-KDE feasible.
Findings
Up to 47x faster than baseline SD-KDE GPU implementation
3,300x faster than scikit-learn's KDE
Completes large-scale estimation in 2.3 seconds on a single GPU
Abstract
Score-debiased kernel density estimation (SD-KDE) achieves improved asymptotic convergence rates over classical KDE, but its use of an empirical score has made it significantly slower in practice. We show that by re-ordering the SD-KDE computation to expose matrix-multiplication structure, Tensor Cores can be used to accelerate the GPU implementation. On a 32k-sample 16-dimensional problem, our approach runs up to faster than a strong SD-KDE GPU baseline and faster than scikit-learn's KDE. On a larger 1M-sample 16-dimensional task evaluated on 131k queries, Flash-SD-KDE completes in s on a single GPU, making score-debiased density estimation practical at previously infeasible scales.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Stochastic Gradient Optimization Techniques · Advanced Data Storage Technologies
