Confidential Computing on NVIDIA Hopper GPUs: A Performance Benchmark Study
Jianwei Zhu, Hang Yin, Peng Deng, Aline Almeida, Shunfan Zhou

TL;DR
This study benchmarks the performance impact of Trusted Execution Environments on NVIDIA Hopper GPUs during large language model inference, highlighting that data transfer overhead is the main bottleneck with minimal GPU computational overhead.
Contribution
It provides the first detailed performance analysis of TEE mode on NVIDIA Hopper GPUs for LLM inference, emphasizing data transfer as the primary performance factor.
Findings
Overhead is below 7% for most LLM queries.
Larger models and longer sequences have nearly zero overhead.
Data transfer via PCIe is the main bottleneck.
Abstract
This report evaluates the performance impact of enabling Trusted Execution Environments (TEE) on NVIDIA Hopper GPUs for large language model (LLM) inference tasks. We benchmark the overhead introduced by TEE mode across various LLMs and token lengths, with a particular focus on the bottleneck caused by CPU-GPU data transfers via PCIe. Our results indicate that while there is minimal computational overhead within the GPU, the overall performance penalty is primarily attributable to data transfer. For the majority of typical LLM queries, the overhead remains below 7%, with larger models and longer sequences experiencing nearly zero overhead.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsChaos-based Image/Signal Encryption
MethodsFocus
