CUDA Kernel Optimization and Counter-Free Performance Analysis for Depthwise Convolution in Cloud Environments

Huriyeh Babak; Melanie Schaller

arXiv:2604.25422·cs.DC·April 30, 2026

CUDA Kernel Optimization and Counter-Free Performance Analysis for Depthwise Convolution in Cloud Environments

Huriyeh Babak, Melanie Schaller

PDF

TL;DR

This paper investigates CUDA kernel optimizations for depthwise convolution in cloud environments, introducing a counter-free performance analysis method that reveals architectural insights without hardware counters.

Contribution

It presents a controlled, operator-level study of CUDA kernel variants and a novel cloud-compatible analysis methodology for GPU performance profiling.

Findings

01

Warp-tiled kernel reduces runtime by 3.26x compared to naive implementation

02

End-to-end training speedup of 1.29x achieved with optimized kernels

03

Analysis method enables architectural profiling without hardware counters

Abstract

Efficient GPU execution of convolution operators is governed by memory-access efficiency, on-chip data reuse, and execution mapping rather than arithmetic throughput alone. This paper presents a controlled operator-level study of CUDA kernel optimization for the depthwise convolution used in Structured State Space Model Convolutional Diagonal (S4ConvD), together with a cloud-compatible, counter-free performance analysis methodology. The operator, model, dataset, and training configuration are fixed, and only the CUDA kernel implementation is varied. The evaluated CUDA kernels comprise naive, global-memory-coalesced, shared-memory cache-blocked, and warp-tiled variants, covering forward, input-gradient, and weight-gradient execution paths under steady-state training conditions. Performance is characterized using a counter-free methodology that combines CUDA-event timing, execution-path…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.