SelectFormer: Private and Practical Data Selection for Transformers
Xu Ouyang, Felix Xiaozhu Lin, Yangfeng Ji

TL;DR
This paper introduces a practical method for private data selection for Transformer models using MPC, significantly reducing evaluation time while maintaining high accuracy in data quality assessment.
Contribution
It proposes a novel pipeline, low-dimensional emulation of nonlinear operators, and parallel MPC scheduling to enable efficient private data selection for Transformers.
Findings
Reduces MPC evaluation time from thousands to tens of hours.
Maintains around 99.8% accuracy compared to direct evaluation.
Effective across diverse Transformer models and benchmarks.
Abstract
Critical to a free data market is , i.e. the model owner selects and then appraises training data from the data owner before both parties commit to a transaction. To keep the data and model private, this process shall evaluate the target model to be trained over Multi-Party Computation (MPC). While prior work suggests that evaluating Transformer-based models over MPC is prohibitively expensive, this paper makes it practical for the purpose of data selection. Our contributions are three: (1) a new pipeline for private data selection over MPC; (2) emulating high-dimensional nonlinear operators with low-dimension MLPs, which are trained on a small sample of the data of interest; (3) scheduling MPC in a parallel, multiphase fashion. We evaluate our method on diverse Transformer models and NLP/CV benchmarks. Compared to directly evaluating the target model…
Peer Reviews
Decision·Submitted to ICLR 2024
The approach of efficiently purchasing data for training artificial intelligence systems within a fixed budget is a novel research direction. This seems to be a necessary research topic not only for AI security but also for various AI training scenarios. The paper successfully persuades the need for the research direction and research topic.
It seems to lack a clear and precise explanation of the technical aspects of the paper. All the technical details regarding the research method are quite ambiguous, making it difficult to understand the core ideas of the paper. While the paper mentions proposing data selection and appraisal methods when training artificial intelligence models using MPC, it does not precisely explain how these techniques are related to MPC protocols and security. Additionally, it doesn't provide a clear explanati
It seems that the performance of the Softmax function is important in transformers when MPC or FHE is considered. The precision of Softmax approximation has a very large impact on the overall inference performance. In this paper, they suggested to use MLP instead of them.
The core ideas proposed in the paper are described on pages 4 and 5, but the description in this part is somewhat unclear. In particular, it is unclear whether data transfer between proxy models occurs using MPC in the multi-phase selection section, or not. This part needs to be restated more clearly.
An MPC-based private data selection framework for large Transformer models. + Nonlinearity evaluation with low-dimensional MLPs. + Multi-phase selection + Parallel MPC executions
- This work seems to simply combine the techniques of data selection and secure inference on LLMs. - Replacing high-dimensional nonlinearity with low-dimensional MLPs seems less general. - The batch evaluation is a widely used method in PPML and lacks novelty.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCryptography and Data Security · Privacy-Preserving Technologies in Data · Stochastic Gradient Optimization Techniques
MethodsMulti-Head Attention · Dense Connections · Linear Layer · Label Smoothing · Absolute Position Encodings · Attention Is All You Need · Adam · Residual Connection · Layer Normalization · Softmax
