CUDA Kernel Optimization and Counter-Free Performance Analysis for Depthwise Convolution in Cloud Environments
Huriyeh Babak, Melanie Schaller

TL;DR
This paper investigates CUDA kernel optimizations for depthwise convolution in cloud environments, introducing a counter-free performance analysis method that reveals architectural insights without hardware counters.
Contribution
It presents a controlled, operator-level study of CUDA kernel variants and a novel cloud-compatible analysis methodology for GPU performance profiling.
Findings
Warp-tiled kernel reduces runtime by 3.26x compared to naive implementation
End-to-end training speedup of 1.29x achieved with optimized kernels
Analysis method enables architectural profiling without hardware counters
Abstract
Efficient GPU execution of convolution operators is governed by memory-access efficiency, on-chip data reuse, and execution mapping rather than arithmetic throughput alone. This paper presents a controlled operator-level study of CUDA kernel optimization for the depthwise convolution used in Structured State Space Model Convolutional Diagonal (S4ConvD), together with a cloud-compatible, counter-free performance analysis methodology. The operator, model, dataset, and training configuration are fixed, and only the CUDA kernel implementation is varied. The evaluated CUDA kernels comprise naive, global-memory-coalesced, shared-memory cache-blocked, and warp-tiled variants, covering forward, input-gradient, and weight-gradient execution paths under steady-state training conditions. Performance is characterized using a counter-free methodology that combines CUDA-event timing, execution-path…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
