Characterizing WebGPU Dispatch Overhead for LLM Inference Across Four GPU Vendors, Three Backends, and Three Browsers

J\k{e}drzej Maczan

arXiv:2604.02344·cs.LG·April 6, 2026

Characterizing WebGPU Dispatch Overhead for LLM Inference Across Four GPU Vendors, Three Backends, and Three Browsers

J\k{e}drzej Maczan

PDF

TL;DR

This paper systematically characterizes WebGPU dispatch overhead for large language model inference across multiple GPUs, browsers, and backends, revealing significant overhead impacts and optimization insights.

Contribution

It introduces a sequential-dispatch methodology that accurately measures WebGPU overhead, highlighting its dominance over kernel compute efficiency at batch size 1.

Findings

01

Per-dispatch overhead is 24-36 μs on Vulkan and 32-71 μs on Metal.

02

Kernel fusion improves throughput by 53% on Vulkan, no benefit on CUDA.

03

WebGPU achieves 11-12% of CUDA performance on reference platform.

Abstract

WebGPU's security-focused design imposes per-operation validation that compounds across the many small dispatches in neural network inference, yet the true cost of this overhead is poorly characterized. We present a systematic characterization of WebGPU dispatch overhead for LLM inference at batch size 1, spanning four GPU vendors (NVIDIA, AMD, Apple, Intel), two native implementations (Dawn, wgpu-native) and three browsers (Chrome, Safari, Firefox), and two model sizes (Qwen2.5-0.5B and 1.5B). Our primary contribution is a sequential-dispatch methodology that reveals naive single-operation benchmarks overestimate dispatch cost by $\sim 20 \times$ . The true per-dispatch cost of WebGPU API overhead alone is 24-36 $μ$ s on Vulkan and 32-71 $μ$ s on Metal, while the total per-operation overhead including Python cost is $\sim 95$ ~ $μ$ s, which turns out to be a distinction critical for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.