TL;DR
VDCores introduces a resource decoupling model for asynchronous GPUs, improving hardware utilization and decoding throughput while reducing programming effort, demonstrated on multiple GPU architectures.
Contribution
It proposes a novel decoupled programming and execution abstraction for asynchronous GPUs, enabling better resource utilization and performance.
Findings
Average decoding throughput improved by 24% across workloads.
Up to 77% throughput increase under dynamic inputs.
Kernel programming effort reduced by 90%.
Abstract
Modern GPUs increasingly rely on specialized and asynchronous hardware units to deliver high performance. Yet these units are often underutilized because today's GPU software stacks still organize programming and execution around a monolithic kernel model that mismatches asynchronous hardware. To address this issue, Virtual Decoupled Engines (VDCores) presents a new decoupled programming and execution model for asynchronous GPUs. VDCores abstracts asynchronous hardware execution units as resource isolated virtual cores and represents workloads as dependency-connected micro-operations (micro-ops). this abstraction removes static orchestration from the programmer, enables automatic overlap of memory and compute based on dependency and resource readiness, and thereby improves utilization of asynchronous hardware resources. Realizing such a decoupled abstraction efficiently on today's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
