PriViT: Vision Transformers for Fast Private Inference
Naren Dhyani, Jianqiao Mo, Minsu Cho, Ameya Joshi, Siddharth Garg,, Brandon Reagen, Chinmay Hegde

TL;DR
PriViT introduces a gradient-based method to modify Vision Transformers, making them more suitable for private inference with secure multi-party computation, while preserving accuracy and improving latency-accuracy trade-offs.
Contribution
The paper presents PriViT, a novel algorithm that selectively Taylorizes nonlinearities in ViTs to enhance MPC efficiency without sacrificing prediction accuracy.
Findings
Achieves better latency-accuracy trade-offs compared to existing methods.
Demonstrates effectiveness on standard image classification benchmarks.
Provides publicly available implementation code.
Abstract
The Vision Transformer (ViT) architecture has emerged as the backbone of choice for state-of-the-art deep models for computer vision applications. However, ViTs are ill-suited for private inference using secure multi-party computation (MPC) protocols, due to the large number of non-polynomial operations (self-attention, feed-forward rectifiers, layer normalization). We propose PriViT, a gradient based algorithm to selectively "Taylorize" nonlinearities in ViTs while maintaining their prediction accuracy. Our algorithm is conceptually simple, easy to implement, and achieves improved performance over existing approaches for designing MPC-friendly transformer architectures in terms of achieving the Pareto frontier in latency-accuracy. We confirm these improvements via experiments on several standard image classification tasks. Public code is available at…
Peer Reviews
Decision·Submitted to ICLR 2024
By adjusting GELU and softmax through training using a switched method, they found a different optimal method for each layer.
Compared to the prior technology, MPCViT, it shows better results in the TinyImagenet but worse latency in the CIFAR-100. In terms of accuracy, it has superior performance in any case. The paper said that DELPHI is focused as the subject of comparison. "In this paper, our focus is exclusively on the DELPHI protocol (Mishra et al., 2020a) for private inference. We choose DELPHI as a matter of convenience;" However, the actual results do not show any performance comparison with DELPHI.
1. The method analysis is clear and latency breakdown is helpful. 2. The experiments on serval image classification benchmarks are solid and comprehensive. 3. The proposed method is speed up than previous SOTA model and achieve competitive performance.
1. Need more detailed about the knowledge distillation part. 2. More discussion about non-linearity distribution.
* The paper is well-motivated as the deployment of ViTs in private scenarios is becoming increasingly important and current approaches are not tailored for Transformer architecture. * The proposed method is quite simple yet effective comparing with SOTA approaches.
[Major] 1. **Experiments:** The authors have conducted sufficient comparative experiments conducted on 3 datasets (CIFAR-10/100 and TinyImageNet). However, the image resolutions are no more than $64\times 64$, which is rather small compared with commonly-used datasets like ImageNet-1k and Caltech-101/256. It will be interesting to see ImageNet results and compare with SENet if possible. 2. **Experiments:** The authors' exclusive use of ViT-Tiny for comparison is insufficient to establish the me
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Memory and Neural Computing · Ferroelectric and Negative Capacitance Devices
MethodsMulti-Head Attention · Attention Is All You Need · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Linear Layer · Label Smoothing · Adam · Absolute Position Encodings · Residual Connection · Layer Normalization
