Memory-Efficient Deep Learning Inference in Trusted Execution Environments
Jean-Baptiste Truong, William Gallagher, Tian Guo, Robert J. Walls

TL;DR
This paper presents techniques to improve deep neural network inference in trusted execution environments by reducing memory bottlenecks and latency through novel partitioning and compression methods.
Contribution
It introduces y-plane partitioning for consistent execution and memory reduction, along with quantization and compression for large weight matrices, enhancing TEE performance.
Findings
Latency overheads increased by 1.09X to 2X with optimizations
Unmodified implementation can incur up to 26X latency
Significant reduction in memory footprint and latency
Abstract
This study identifies and proposes techniques to alleviate two key bottlenecks to executing deep neural networks in trusted execution environments (TEEs): page thrashing during the execution of convolutional layers and the decryption of large weight matrices in fully-connected layers. For the former, we propose a novel partitioning scheme, y-plane partitioning, designed to (i) provide consistent execution time when the layer output is large compared to the TEE secure memory; and (ii) significantly reduce the memory footprint of convolutional layers. For the latter, we leverage quantization and compression. In our evaluation, the proposed optimizations incurred latency overheads ranging from 1.09X to 2X baseline for a wide range of TEE sizes; in contrast, an unmodified implementation incurred latencies of up to 26X when running inside of the TEE.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
