SelectFormer: Private and Practical Data Selection for Transformers

Xu Ouyang; Felix Xiaozhu Lin; Yangfeng Ji

arXiv:2310.02373·cs.LG·March 4, 2025

SelectFormer: Private and Practical Data Selection for Transformers

Xu Ouyang, Felix Xiaozhu Lin, Yangfeng Ji

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a practical method for private data selection for Transformer models using MPC, significantly reducing evaluation time while maintaining high accuracy in data quality assessment.

Contribution

It proposes a novel pipeline, low-dimensional emulation of nonlinear operators, and parallel MPC scheduling to enable efficient private data selection for Transformers.

Findings

01

Reduces MPC evaluation time from thousands to tens of hours.

02

Maintains around 99.8% accuracy compared to direct evaluation.

03

Effective across diverse Transformer models and benchmarks.

Abstract

Critical to a free data market is $private data selection$ , i.e. the model owner selects and then appraises training data from the data owner before both parties commit to a transaction. To keep the data and model private, this process shall evaluate the target model to be trained over Multi-Party Computation (MPC). While prior work suggests that evaluating Transformer-based models over MPC is prohibitively expensive, this paper makes it practical for the purpose of data selection. Our contributions are three: (1) a new pipeline for private data selection over MPC; (2) emulating high-dimensional nonlinear operators with low-dimension MLPs, which are trained on a small sample of the data of interest; (3) scheduling MPC in a parallel, multiphase fashion. We evaluate our method on diverse Transformer models and NLP/CV benchmarks. Compared to directly evaluating the target model…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 3· reject, not good enoughConfidence 3

Strengths

The approach of efficiently purchasing data for training artificial intelligence systems within a fixed budget is a novel research direction. This seems to be a necessary research topic not only for AI security but also for various AI training scenarios. The paper successfully persuades the need for the research direction and research topic.

Weaknesses

It seems to lack a clear and precise explanation of the technical aspects of the paper. All the technical details regarding the research method are quite ambiguous, making it difficult to understand the core ideas of the paper. While the paper mentions proposing data selection and appraisal methods when training artificial intelligence models using MPC, it does not precisely explain how these techniques are related to MPC protocols and security. Additionally, it doesn't provide a clear explanati

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

It seems that the performance of the Softmax function is important in transformers when MPC or FHE is considered. The precision of Softmax approximation has a very large impact on the overall inference performance. In this paper, they suggested to use MLP instead of them.

Weaknesses

The core ideas proposed in the paper are described on pages 4 and 5, but the description in this part is somewhat unclear. In particular, it is unclear whether data transfer between proxy models occurs using MPC in the multi-phase selection section, or not. This part needs to be restated more clearly.

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

An MPC-based private data selection framework for large Transformer models. + Nonlinearity evaluation with low-dimensional MLPs. + Multi-phase selection + Parallel MPC executions

Weaknesses

- This work seems to simply combine the techniques of data selection and secure inference on LLMs. - Replacing high-dimensional nonlinearity with low-dimensional MLPs seems less general. - The batch evaluation is a widely used method in PPML and lacks novelty.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCryptography and Data Security · Privacy-Preserving Technologies in Data · Stochastic Gradient Optimization Techniques

MethodsMulti-Head Attention · Dense Connections · Linear Layer · Label Smoothing · Absolute Position Encodings · Attention Is All You Need · Adam · Residual Connection · Layer Normalization · Softmax