Using Butterfly-Patterned Partial Sums to Optimize GPU Memory Accesses for Drawing from Discrete Distributions
Guy L. Steele Jr. (Oracle Labs), Jean-Baptiste Tristan (Oracle, Labs)

TL;DR
This paper introduces a butterfly-patterned partial sums technique that accelerates sampling from discrete distributions on GPUs by optimizing memory access, significantly improving LDA performance for large cluster counts.
Contribution
The paper presents a novel butterfly-patterned partial sums method that reduces computation time and enhances GPU memory access efficiency during discrete distribution sampling.
Findings
Doubles the speed of LDA for K > 200 clusters.
Uses butterfly-patterned tables for faster partial sum computation.
Improves GPU memory access efficiency during sampling.
Abstract
We describe a technique for drawing values from discrete distributions, such as sampling from the random variables of a mixture model, that avoids computing a complete table of partial sums of the relative probabilities. A table of alternate ("butterfly-patterned") form is faster to compute, making better use of coalesced memory accesses. From this table, complete partial sums are computed on the fly during a binary search. Measurements using an NVIDIA Titan Black GPU show that for a sufficiently large number of clusters or topics (K > 200), this technique alone more than doubles the speed of a latent Dirichlet allocation (LDA) application already highly tuned for GPU execution.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Methods and Mixture Models · Gaussian Processes and Bayesian Inference · Music and Audio Processing
