Flash-SD-KDE: Accelerating SD-KDE with Tensor Cores

Elliot L. Epstein; Rajat Vadiraj Dwaraknath; John Winnicki

arXiv:2602.10378·cs.DC·February 12, 2026

Flash-SD-KDE: Accelerating SD-KDE with Tensor Cores

Elliot L. Epstein, Rajat Vadiraj Dwaraknath, John Winnicki

PDF

Open Access

TL;DR

This paper introduces Flash-SD-KDE, a GPU-accelerated method for score-debiased kernel density estimation that significantly improves speed, enabling practical large-scale applications.

Contribution

The paper re-structures SD-KDE computations to leverage Tensor Cores, achieving substantial speedups over existing implementations and making large-scale SD-KDE feasible.

Findings

01

Up to 47x faster than baseline SD-KDE GPU implementation

02

3,300x faster than scikit-learn's KDE

03

Completes large-scale estimation in 2.3 seconds on a single GPU

Abstract

Score-debiased kernel density estimation (SD-KDE) achieves improved asymptotic convergence rates over classical KDE, but its use of an empirical score has made it significantly slower in practice. We show that by re-ordering the SD-KDE computation to expose matrix-multiplication structure, Tensor Cores can be used to accelerate the GPU implementation. On a 32k-sample 16-dimensional problem, our approach runs up to $47 \times$ faster than a strong SD-KDE GPU baseline and $3, 300 \times$ faster than scikit-learn's KDE. On a larger 1M-sample 16-dimensional task evaluated on 131k queries, Flash-SD-KDE completes in $2.3$ s on a single GPU, making score-debiased density estimation practical at previously infeasible scales.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Stochastic Gradient Optimization Techniques · Advanced Data Storage Technologies