Characterization of GPU TEE Overheads in Distributed Data Parallel ML Training
Jonghyun Lee, Yongqin Wang, Rachit Rajat, Murali Annavaram

TL;DR
This paper thoroughly examines the performance overhead introduced by GPU trusted execution environments in distributed data parallel machine learning training, revealing significant runtime increases due to encryption and authentication during inter-GPU communication.
Contribution
It provides the first detailed characterization of GPU TEE overheads in distributed ML training, highlighting the impact of secure communication on performance.
Findings
Runtime per iteration increases by up to 41.6x with GPU TEEs.
Overheads grow with the number of GPUs and model size.
Secure communication costs dominate the performance impact.
Abstract
Confidential computing (CC) or trusted execution enclaves (TEEs) is now the most common approach to enable secure computing in the cloud. The recent introduction of GPU TEEs by NVIDIA enables machine learning (ML) models to be trained without leaking model weights or data to the cloud provider. However, the potential performance implications of using GPU TEEs for ML training are not well characterized. In this work, we present an in-depth characterization study on performance overhead associated with running distributed data parallel (DDP) ML training with GPU Trusted Execution Environments (TEE). Our study reveals the performance challenges in DDP training within GPU TEEs. DDP uses ring-all-reduce, a well-known approach, to aggregate gradients from multiple devices. Ring all-reduce consists of multiple scatter-reduce and all-gather operations. In GPU TEEs only the GPU package (GPU…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems
