Fast Histograms using Adaptive CUDA Streams
Sisir Koppaka, Dheevatsa Mudigere, Srihari Narasimhan, Babu Narayanan

TL;DR
This paper introduces an adaptive CUDA stream-based approach for fast histogram computation that minimizes atomic operation bottlenecks by dynamically switching kernels based on input stream characteristics.
Contribution
It proposes a novel adaptive kernel and stream model for CUDA that optimizes histogram computation by reducing atomic operation overhead and intelligently switching kernels.
Findings
Significant speedup over standard CUDA histogram implementations
Effective kernel switching reduces latency based on input stream degeneracy
Adaptive model improves throughput in high-speed data streams
Abstract
Histograms are widely used in medical imaging, network intrusion detection, packet analysis and other stream-based high throughput applications. However, while porting such software stacks to the GPU, the computation of the histogram is a typical bottleneck primarily due to the large impact on kernel speed by atomic operations. In this work, we propose a stream-based model implemented in CUDA, using a new adaptive kernel that can be optimized based on latency hidden CPU compute. We also explore the tradeoffs of using the new kernel vis-\`a-vis the stock NVIDIA SDK kernel, and discuss an intelligent kernel switching method for the stream based on a degeneracy criterion that is adaptively computed from the input stream.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Error Correcting Code Techniques
