Characterization of GPU TEE Overheads in Distributed Data Parallel ML Training

Jonghyun Lee; Yongqin Wang; Rachit Rajat; Murali Annavaram

arXiv:2501.11771·cs.CR·August 15, 2025

Characterization of GPU TEE Overheads in Distributed Data Parallel ML Training

Jonghyun Lee, Yongqin Wang, Rachit Rajat, Murali Annavaram

PDF

Open Access

TL;DR

This paper thoroughly examines the performance overhead introduced by GPU trusted execution environments in distributed data parallel machine learning training, revealing significant runtime increases due to encryption and authentication during inter-GPU communication.

Contribution

It provides the first detailed characterization of GPU TEE overheads in distributed ML training, highlighting the impact of secure communication on performance.

Findings

01

Runtime per iteration increases by up to 41.6x with GPU TEEs.

02

Overheads grow with the number of GPUs and model size.

03

Secure communication costs dominate the performance impact.

Abstract

Confidential computing (CC) or trusted execution enclaves (TEEs) is now the most common approach to enable secure computing in the cloud. The recent introduction of GPU TEEs by NVIDIA enables machine learning (ML) models to be trained without leaking model weights or data to the cloud provider. However, the potential performance implications of using GPU TEEs for ML training are not well characterized. In this work, we present an in-depth characterization study on performance overhead associated with running distributed data parallel (DDP) ML training with GPU Trusted Execution Environments (TEE). Our study reveals the performance challenges in DDP training within GPU TEEs. DDP uses ring-all-reduce, a well-known approach, to aggregate gradients from multiple devices. Ring all-reduce consists of multiple scatter-reduce and all-gather operations. In GPU TEEs only the GPU package (GPU…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems