AEGIS: Scaling Long-Sequence Homomorphic Encrypted Transformer Inference via Hybrid Parallelism on Multi-GPU Systems
Zhaoting Gong, Ran Ran, Fan Yao, and Wujie Wen

TL;DR
AEGIS introduces a novel system for scalable long-sequence encrypted Transformer inference on multi-GPU systems, significantly reducing communication and memory usage while maintaining high efficiency.
Contribution
It proposes a device placement strategy based on ciphertext dependencies to optimize multi-GPU homomorphic Transformer inference, reducing communication and improving scalability.
Findings
Reduces inter-GPU communication by up to 81.3%
Achieves 96.62% scaling efficiency on four GPUs
Attains 3.86x end-to-end speedup and 69.1% per-device memory reduction
Abstract
Fully Homomorphic Encryption (FHE) enables privacy-preserving Transformer inference, but long-sequence encrypted Transformers quickly exceed single-GPU memory capacity because encoded weights are already large and encrypted activations grow rapidly with sequence length. Multi-GPU execution therefore becomes unavoidable, yet scaling remains challenging because communication is jointly induced by application-level aggregation and encryption-level RNS coupling. Existing approaches either synchronize between devices frequently or replicate encrypted tensors across devices, leading to excessive communication and latency. We present AEGIS, an Application-Encryption Guided Inference System for scalable long-sequence encrypted Transformer inference on multi-GPU platforms. AEGIS derives device placement from ciphertext dependencies jointly induced by Transformer dataflow and CKKS polynomial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
