SecMoE: Communication-Efficient Secure MoE Inference via Select-Then-Compute
Bowen Shen, Yuyue Chen, Peng Yang, Bin Zhang, Xi Zhang, Zoe L. Jiang

TL;DR
SecMoE introduces a communication-efficient, privacy-preserving MoE inference framework that significantly scales model size and reduces communication overhead while maintaining privacy in two-party settings.
Contribution
It proposes a novel Select-Then-Compute approach that enhances privacy and efficiency in secure MoE inference, enabling larger models with less communication and computation.
Findings
Scales to 63× larger models with only 15.2× runtime increase.
Reduces communication by 1.8× to 7.1× compared to SOTA.
Achieves 1.3× to 3.8× speedup over existing protocols.
Abstract
Privacy-preserving Transformer inference has gained attention due to the potential leakage of private information. Despite recent progress, existing frameworks still fall short of practical model scales, with gaps up to a hundredfold. A possible way to close this gap is the Mixture of Experts (MoE) architecture, which has emerged as a promising technique to scale up model capacity with minimal overhead. However, given that the current secure two-party (2-PC) protocols allow the server to homomorphically compute the FFN layer with its plaintext model weight, under the MoE setting, this could reveal which expert is activated to the server, exposing token-level privacy about the client's input. While naively evaluating all the experts before selection could protect privacy, it nullifies MoE sparsity and incurs the heavy computational overhead that sparse MoE seeks to avoid. To address the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Cryptography and Data Security · Adversarial Robustness in Machine Learning
