Ditto: Quantization-aware Secure Inference of Transformers upon MPC
Haoqi Wu, Wenjing Fang, Yancheng Zheng, Junming Ma, Jin Tan, Yinggui, Wang, Lei Wang

TL;DR
Ditto introduces a quantization-aware framework for secure Transformer inference using MPC, significantly reducing computation and communication overhead while maintaining model utility, demonstrated on BERT and GPT-2 models.
Contribution
It integrates quantization-aware techniques into MPC-based secure inference for Transformers, with novel primitives for type conversions and a distillation process to preserve accuracy.
Findings
Ditto is 3.14 to 4.40 times faster than MPCFormer.
Ditto is 1.44 to 2.35 times faster than PUMA.
Achieves negligible utility degradation.
Abstract
Due to the rising privacy concerns on sensitive client data and trained models like Transformers, secure multi-party computation (MPC) techniques are employed to enable secure inference despite attendant overhead. Existing works attempt to reduce the overhead using more MPC-friendly non-linear function approximations. However, the integration of quantization widely used in plaintext inference into the MPC domain remains unclear. To bridge this gap, we propose the framework named Ditto to enable more efficient quantization-aware secure Transformer inference. Concretely, we first incorporate an MPC-friendly quantization into Transformer inference and employ a quantization-aware distillation procedure to maintain the model utility. Then, we propose novel MPC primitives to support the type conversions that are essential in quantization and implement the quantization-aware MPC execution of…
Peer Reviews
Decision·ICML 2024 Poster
+ MPC-friendly Quantization-Aware Distillation. + MPC primitives for scale down and scale up. + Comparison with SOTA.
- Distillation is widely used in MPC-based secure inference works. - It seems limited contributions of MPC protocols.
1. This paper targets an important problem in private inference. 2. The proposed type conversion protocols are creative solutions to a key challenge in quantization-aware secure inference. 3. Extensive evaluations analyzing efficiency, utility, scalability, and communication costs and latency on factors like sequence length and batch size.
1. Lack of comparison to the latest related work.
* The authors present a solution that addresses multiple bottlenecks in secure multi-party computation (MPC) for Transformer models. For example, challenges like handling non-linear functions and dynamic quantization in an MPC context. They also offer a solution such as modified dyadic quantization and static dyadic quantization for these issues. * The paper highlights and addresses the often-overlooked disconnect between the expertise in machine learning and multi-party computation. For examp
* The paper acknowledges that both Ditto and MPCFormer exhibit noticeable utility drops in Bert tasks when employing ReLU approximation for Softmax. They offer Quad approximation for GeLU to maintain a balance between utility and efficiency, but this limitation may constrain the applicability of the framework for tasks where such approximations are not tolerable. * The paper in general is hard to read and require additional proof-reading. I would recommend making the paper to be easier to read
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · Advanced Data Storage Technologies · Semiconductor materials and devices
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Weight Decay · Attention Dropout · Dropout · Label Smoothing · Residual Connection · Softmax · WordPiece · Position-Wise Feed-Forward Layer
