Privacy-preserving Distributed Machine Learning via Local Randomization   and ADMM Perturbation

Xin Wang; Hideaki Ishii; Linkang Du; Peng Cheng; Jiming Chen

arXiv:1908.01059·cs.LG·August 26, 2020

Privacy-preserving Distributed Machine Learning via Local Randomization and ADMM Perturbation

Xin Wang, Hideaki Ishii, Linkang Du, Peng Cheng, Jiming Chen

PDF

TL;DR

This paper introduces a privacy-preserving distributed machine learning framework using ADMM with local randomization and noise perturbation, enabling heterogeneous privacy guarantees without trusting the server and minimizing privacy loss over iterations.

Contribution

It proposes a novel ADMM-based DML framework that does not assume trusted servers and offers heterogeneous privacy levels based on data sensitivity and trust degrees.

Findings

01

The framework effectively balances privacy and model accuracy.

02

Experimental results validate the theoretical privacy guarantees.

03

The approach reduces privacy loss over multiple ADMM iterations.

Abstract

With the proliferation of training data, distributed machine learning (DML) is becoming more competent for large-scale learning tasks. However, privacy concerns have to be given priority in DML, since training data may contain sensitive information of users. In this paper, we propose a privacy-preserving ADMM-based DML framework with two novel features: First, we remove the assumption commonly made in the literature that the users trust the server collecting their data. Second, the framework provides heterogeneous privacy for users depending on data's sensitive levels and servers' trust degrees. The challenging issue is to keep the accumulation of privacy losses over ADMM iterations minimal. In the proposed framework, a local randomization approach, which is differentially private, is adopted to provide users with self-controlled privacy guarantee for the most sensitive information.…

Tables1

Table 1. TABLE I: Classification accuracy with test data (%)

Dataset	Without privacy protection	Modified loss		Perturbed ADMM
Dataset	Without privacy protection	$ϵ = 0.4$ , $R = 0$	$ϵ = 1$ , $R = 0$	$ϵ = 0.4$ , $R = 1$	$ϵ = 0.4$ , $R = 9$	$ϵ = 1$ , $R = 1$	$ϵ = 1$ , $R = 9$
German	75.00	71.00	74.00	69.67	64.00	74.33	67.67
Image	75.56	70.13	72.84	69.33	63.10	70.45	65.50
Ringnorm	77.38	73.44	76.82	73.74	66.18	75.77	70.23
Banana	58.22	54.33	56.06	54.28	43.11	55.89	54.44
Splice	56.60	46.84	56.60	54.94	46.39	55.83	52.50
Twonorm	97.90	96.59	97.38	96.51	92.28	97.41	94.77
Waveform	88.93	84.60	87.93	84.07	80.47	87.67	81.73

Equations151

J ({w_{i}}_{i \in S}) := i = 1 \sum n [j = 1 \sum m_{i} \frac{1}{m _{i}} ℓ (y_{i, j}, w_{i}^{T} x_{i, j}) + \frac{a}{n} N (w_{i})],

J ({w_{i}}_{i \in S}) := i = 1 \sum n [j = 1 \sum m_{i} \frac{1}{m _{i}} ℓ (y_{i, j}, w_{i}^{T} x_{i, j}) + \frac{a}{n} N (w_{i})],

∣ ℓ (\cdot) ∣ \leq c_{1}, \frac{\partial ℓ ( \cdot )}{\partial w} \leq c_{2}, \frac{\partial ℓ ^{2} ( \cdot )}{\partial w ^{2}} \leq c_{3},

∣ ℓ (\cdot) ∣ \leq c_{1}, \frac{\partial ℓ ( \cdot )}{\partial w} \leq c_{2}, \frac{\partial ℓ ^{2} ( \cdot )}{\partial w ^{2}} \leq c_{3},

N (w_{1}) - N (w_{2}) \geq \nabla N (w_{1})^{T} (w_{2} - w_{1}) + \frac{κ}{2} ∥ w_{2} - w_{1} ∥_{2}^{2},

N (w_{1}) - N (w_{2}) \geq \nabla N (w_{1})^{T} (w_{2} - w_{1}) + \frac{κ}{2} ∥ w_{2} - w_{1} ∥_{2}^{2},

J_{i} (w_{i}) := j = 1 \sum m_{i} \frac{1}{m _{i}} ℓ (y_{i, j}, w_{i}^{T} x_{i, j}) + \frac{a}{n} N (w_{i}) .

J_{i} (w_{i}) := j = 1 \sum m_{i} \frac{1}{m _{i}} ℓ (y_{i, j}, w_{i}^{T} x_{i, j}) + \frac{a}{n} N (w_{i}) .

w_{i} = z_{i l}, w_{l} = z_{i l}, \forall (s_{i}, s_{l}) \in E,

w_{i} = z_{i l}, w_{l} = z_{i l}, \forall (s_{i}, s_{l}) \in E,

{w_{i}}, {z_{i, l}} min

{w_{i}}, {z_{i, l}} min

s.t.

w, z min

w, z min

s.t.

L_{+} := \frac{1}{2} (A_{1} + A_{2})^{T} (A_{1} + A_{2}), L_{-} := \frac{1}{2} (A_{1} - A_{2})^{T} (A_{1} - A_{2}) .

L_{+} := \frac{1}{2} (A_{1} + A_{2})^{T} (A_{1} + A_{2}), L_{-} := \frac{1}{2} (A_{1} - A_{2})^{T} (A_{1} - A_{2}) .

\nabla J (w (t + 1)) + γ (t) + β (L_{+} + L_{-}) w (t + 1) - β L_{+} w (t)

\nabla J (w (t + 1)) + γ (t) + β (L_{+} + L_{-}) w (t + 1) - β L_{+} w (t)

γ (t + 1) - γ (t) - β L_{-} w (t + 1)

L_{i} (w_{i}, {w_{l} (t)}_{l \in N_{i} ⋃ {i}}, γ_{i} (t)) := J_{i} (w_{i}) + γ_{i}^{T} (t) w_{i} + β l \in N_{i} \sum w_{i} - \frac{1}{2} (w_{i} (t) + w_{l} (t))_{2}^{2} .

L_{i} (w_{i}, {w_{l} (t)}_{l \in N_{i} ⋃ {i}}, γ_{i} (t)) := J_{i} (w_{i}) + γ_{i}^{T} (t) w_{i} + β l \in N_{i} \sum w_{i} - \frac{1}{2} (w_{i} (t) + w_{l} (t))_{2}^{2} .

w_{i} (t + 1)

w_{i} (t + 1)

γ_{i} (t + 1)

Pr [M (d_{1}) \in O] \leq e^{ϵ} Pr [M (d_{2}) \in O] .

Pr [M (d_{1}) \in O] \leq e^{ϵ} Pr [M (d_{2}) \in O] .

y_{i, j}^{'} = ⎩ ⎨ ⎧ 1, - 1, y_{i, j}, with probability p with probability p with probability 1 - 2 p .

y_{i, j}^{'} = ⎩ ⎨ ⎧ 1, - 1, y_{i, j}, with probability p with probability p with probability 1 - 2 p .

p = \frac{1}{1 + e ^{ϵ}},

p = \frac{1}{1 + e ^{ϵ}},

J_{i} (w_{i}) := J_{i} (w_{i}) + \frac{1}{n} η_{i}^{T} w_{i},

J_{i} (w_{i}) := J_{i} (w_{i}) + \frac{1}{n} η_{i}^{T} w_{i},

\tilde{L}_{i} (w_{i}, {w_{l} (t)}_{l \in N_{i} ⋃ {i}}, γ_{i} (t)) := J_{i} (w_{i}) + γ_{i}^{T} (t) w_{i} + β l \in N_{i} \sum w_{i} - \frac{1}{2} (w_{i} (t) + w_{l} (t))_{2}^{2} .

\tilde{L}_{i} (w_{i}, {w_{l} (t)}_{l \in N_{i} ⋃ {i}}, γ_{i} (t)) := J_{i} (w_{i}) + γ_{i}^{T} (t) w_{i} + β l \in N_{i} \sum w_{i} - \frac{1}{2} (w_{i} (t) + w_{l} (t))_{2}^{2} .

w_{i} (t + 1)

w_{i} (t + 1)

w_{i} (t + 1)

γ_{i} (t + 1)

J_{P} (w) := E_{(x, y) \sim P} [ℓ (y, w^{T} x)] + \frac{a}{n} N (w) .

J_{P} (w) := E_{(x, y) \sim P} [ℓ (y, w^{T} x)] + \frac{a}{n} N (w) .

\hat{ℓ} (y_{i, j}^{'}, w_{i}^{T} x_{i, j}, ϵ) := \frac{e ^{ϵ} ℓ ( y _{i, j}^{'} , w _{i}^{T} x _{i, j} ) - ℓ ( - y _{i, j}^{'} , w _{i}^{T} x _{i, j} )}{e ^{ϵ} - 1} .

\hat{ℓ} (y_{i, j}^{'}, w_{i}^{T} x_{i, j}, ϵ) := \frac{e ^{ϵ} ℓ ( y _{i, j}^{'} , w _{i}^{T} x _{i, j} ) - ℓ ( - y _{i, j}^{'} , w _{i}^{T} x _{i, j} )}{e ^{ϵ} - 1} .

E_{y_{i, j}^{'}} [\hat{ℓ} (y_{i, j}^{'}, w_{i}^{T} x_{i, j}, ϵ)] = ℓ (y_{i, j}, w_{i}^{T} x_{i, j}) .

E_{y_{i, j}^{'}} [\hat{ℓ} (y_{i, j}^{'}, w_{i}^{T} x_{i, j}, ϵ)] = ℓ (y_{i, j}, w_{i}^{T} x_{i, j}) .

\overset{c}{^}_{2} := \frac{e ^{ϵ} + 1}{e ^{ϵ} - 1} c_{2},

\overset{c}{^}_{2} := \frac{e ^{ϵ} + 1}{e ^{ϵ} - 1} c_{2},

J_{i} (w_{i}) := j = 1 \sum m_{i} \frac{1}{m _{i}} \hat{ℓ} (y_{i, j}^{'}, w_{i}^{T} x_{i, j}, ϵ) + \frac{a}{n} N (w_{i}) .

J_{i} (w_{i}) := j = 1 \sum m_{i} \frac{1}{m _{i}} \hat{ℓ} (y_{i, j}^{'}, w_{i}^{T} x_{i, j}, ϵ) + \frac{a}{n} N (w_{i}) .

{w_{i}} min

{w_{i}} min

s.t.

J ({w_{i}}_{i \in S}) = i = 1 \sum n [J_{i} (w_{i}) + \frac{1}{n} η_{i}^{T} w_{i}] .

J ({w_{i}}_{i \in S}) = i = 1 \sum n [J_{i} (w_{i}) + \frac{1}{n} η_{i}^{T} w_{i}] .

{w_{i}} min

{w_{i}} min

s.t.

w_{opt} = w_{i} = w_{l}, \forall i, l .

w_{opt} = w_{i} = w_{l}, \forall i, l .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAlternating Direction Method of Multipliers

Full text

Privacy-preserving Distributed Machine Learning via Local Randomization and ADMM Perturbation

Xin Wang, Hideaki Ishii, Linkang Du, Peng Cheng, and Jiming Chen X. Wang, L. Du, P. Cheng and J. Chen are with the State Key Lab. of Industrial Control Technology, Zhejiang University, Hangzhou, 310027, P. R. China. X. Wang is also with the Dept. of Computer Science, Tokyo Institute of Technology, Yokohama, 226-8502, Japan. Emails: [email protected]; [email protected]; [email protected]; [email protected]. Ishii is with the Dept. of Computer Science, Tokyo Institute of Technology, Yokohama, 226-8502, Japan. Email: [email protected]

Abstract

With the proliferation of training data, distributed machine learning (DML) is becoming more competent for large-scale learning tasks. However, privacy concerns have to be given priority in DML, since training data may contain sensitive information of users. In this paper, we propose a privacy-preserving ADMM-based DML framework with two novel features: First, we remove the assumption commonly made in the literature that the users trust the server collecting their data. Second, the framework provides heterogeneous privacy for users depending on data’s sensitive levels and servers’ trust degrees. The challenging issue is to keep the accumulation of privacy losses over ADMM iterations minimal. In the proposed framework, a local randomization approach, which is differentially private, is adopted to provide users with self-controlled privacy guarantee for the most sensitive information. Further, the ADMM algorithm is perturbed through a combined noise-adding method, which simultaneously preserves privacy for users’ less sensitive information and strengthens the privacy protection of the most sensitive information. We provide detailed analyses on the performance of the trained model according to its generalization error. Finally, we conduct extensive experiments using real-world datasets to validate the theoretical results and evaluate the classification performance of the proposed framework.

Index Terms:

Distributed machine learning, privacy preservation, ADMM, generalization error.

I Introduction

In the era of big data, distributed machine learning (DML) is increasingly applied in various areas of our daily lives, especially with proliferation of training data. Typical applications of DML include machine-aided prescription [1], natural language processing [2], recommender systems [3], to name a few. Compared with the traditional single-machine model, DML is more competent for large-scale learning tasks due to its scalability and robustness to faults. The alternating direction method of multipliers (ADMM), as a commonly-used parallel computing approach in optimization community, is a simple but efficient algorithm for multiple servers to collaboratively solve learning problems [4]. Our DML framework also use ADMM as the underlying algorithm.

However, privacy is a significant issue that has to be considered in DML. In many machine learning tasks, users’ data for training the prediction model contains sensitive information, such as genotypes, salaries, and political orientations. For example, if we adopt DML methods to predict HIV-1 infection [5], the data used for protein-protein interactions identification mainly includes patients’ information about their proteins, labels indicating whether they are HIV-1 infected or not, and other kinds of health data. Such information, especially the labels, is extremely sensitive for the patients. Moreover, there exist potential risks of privacy disclosure. On one hand, when users report their data to servers, illegal parties can eavesdrop the data transmission processes or penetrate the servers to steal reported data. On the other, the communicated information between servers, which is required to train a common prediction model, can also disclose users’ private data. If these disclosure risks are not properly controlled, users would refuse to contribute their data to servers even though DML may bring convenience for them.

Various privacy-preserving solutions have been proposed in the literature. Differential privacy (DP) [6] is one of the standard non-cryptographical approaches and has been applied in distributed computing scenarios [7, 8, 9, 10]. Other schemes which are not DP-preserving can be found in [11, 12, 13]. In addition, privacy-aware machine learning problems [14, 15, 16, 17] have attracted a lot of attentions, and many researchers have proposed ADMM-based solutions [18, 19, 20]. However, there exists an underlying assumption in most privacy-aware schemes that the data contributors trust the servers collecting their data. This trustworthy assumption may lead to privacy disclosure in many cases. For instance, when the server is penetrated by an adversary, the information obtained by the adversary may be the users’ original private data.

Moreover, most existing schemes provide the same privacy guarantee for the entire data sample of a user though different data pieces are likely to have distinct sensitive levels. In the example of HIV-1 infection prediction [5] mentioned above, it is obvious that the label indicating HIV-1 infected or uninfected is more sensitive than other health data. Thus, the data pieces with higher sensitive levels should obtain stronger protection. On the other hand, as claimed in [7], different servers present diverse trust degrees to users due to the distinct permissions to users’ data. The servers having no direct connection with a user, compared with the server collecting his/her data, may be less trustworthy. Here, the user would require that the less trustworthy servers obtain his/her information under stronger privacy preservation. Therefore, we will investigate a privacy-aware DML framework that preserves heterogeneous privacy, where users’ data pieces with distinct sensitive levels can obtain different privacy guarantee against servers of diverse trust degrees.

One challenging issue is to reduce the accumulation of privacy losses over ADMM iterations as much as possible, especially for the privacy guarantee of the most sensitive data pieces. Most existing ADMM-based private DML frameworks preserve privacy by perturbing the intermediate results shared by servers. Since each intermediate result is computed with users’ original data, its release will disclose part of private information, implying that the privacy loss may increase as iterations proceed. Moreover, these private DML frameworks only provide the same privacy guarantee for all data pieces. In addition to intermediate information perturbation, original data randomization methods can be combined to provide heterogeneous privacy protection. However, such an approach introduces coupled uncertainties into the classification model. The lack of uncertainty decoupling methods leads to the performance quantification a challenging task.

In this paper, we propose a privacy-preserving distributed machine learning (PDML) framework to settle these challenging issues. After removing the trustworthy servers assumption, we incorporate the users’ data reporting into the DML process, which forms a two-phase training scheme together with the distributed computing process. For privacy preservation, we adopt different approaches in the two phases. In Phase 1, a user first leverages a local randomization approach to obfuscate the most sensitive data pieces and sends the randomized version to a server. This technique provides the user with self-controlled privacy guarantee for the most sensitive information. Further, in Phase 2, multiple servers collaboratively train a common prediction model and there, they use a combined noise-adding method to perturb the communicated messages, which preserves privacy for users’ less sensitive data pieces. Also, such perturbation strengthens the privacy preservation of data pieces with the highest sensitive level. For the performance of the PDML framework, we analyze the generalization error of current classifiers trained by different servers.

The main contributions of this paper are threefold:

A two-phase PDML framework is proposed to provide heterogeneous privacy protection in DML, where users’ data pieces obtain different privacy guarantees depending on their sensitive levels and servers’ trust degrees. 2. 2.

In Phase 1, we design a local randomization approach, which preserves DP for the users’ most sensitive information. In Phase 2, a combined noise-adding method is devised to compensate the privacy protection of other data pieces. 3. 3.

The convergence property of the proposed ADMM-based privacy-aware algorithm is analyzed. We also give a theoretical bound of the difference between the generalization error of trained classifiers and the ideal optimal classifier.

The remainder of this paper is organized as follows. Related works are discussed in Section II. We provide some preliminaries and formulate the problem in Section III. Section IV presents the designed privacy-preserving framework, and the performance is analyzed in Section V. In order to validate the classification performance, we use multiple real-world datasets and conduct experiments in Section VI. Finally, Section VII concludes the paper. A preliminary version [21] of this paper was accepted for presentation at IEEE CDC 2019. This paper contains a different privacy-preserving approach with a fully distributed ADMM setting, full proofs of the main results, and more experimental results.

II Related Works

As one of the important applications of distributed optimization, DML has received widespread attentions from researchers. Besides ADMM schemes, many distributed approaches have been proposed in the literature, e.g., subgradient descent methods [22], local message-passing algorithms [23], adaptive diffusion mechanisms [24], and dual averaging approaches [25]. Compared with these approaches, ADMM schemes achieve faster empirical convergence [26], making it more suitable for large-scale DML tasks.

For privacy-preserving problems, cryptographic techniques [27, 28, 29] are often used to protect information from being inferred when the key is unknown. In particular, homomorphic encryption methods [28], [29] allow untrustworthy servers to calculate with encrypted data, and this approach has been applied in an ADMM scheme [20]. Nevertheless, such schemes unavoidably bring extra computation and communication overheads. Another commonly used approach to preserve privacy is random value perturbation [6], [30], [31]. DP has been increasingly acknowledged as the de facto criterion for non-encryption-based data privacy. This approach requires less costs but still provides strong privacy guarantee, though there exist tradeoffs between privacy and performance [7].

In recent years, random value perturbation-based approaches have been widely used to address privacy protection in distributed computing, especially in consensus problems [32]. For instance, [7, 8, 9], [11, 12, 13] provide privacy-preserving average consensus paradigms, where the mechanisms in [7, 8, 9] provide DP guarantee. Moreover, for a maximum consensus algorithm, [10] gives a differentially private mechanism. Since these solutions mainly focus on simple statistical analysis (e.g., computation of average and maximum elements), there may exist difficulties in directly applying them to DML.

Privacy-preserving machine learning problems have also attracted a lot of attention recently. Under centralized scenarios, Chaudhuri et al. [14] proposed a DP solution for an empirical risk minimization problem by perturbing the objective function with well-designed noise. For privacy-aware DML, Han et al. [33] also gave a differentially private mechanism, where the underlying distributed approach is subgradient descent. The works [15] and [16] present dynamic DP schemes for ADMM-based DML, where privacy guarantee is provided in each iteration. However, if a privacy violator uses the published information in all iterations to make inference, there will be no privacy guarantee. In addition, an obfuscated stochastic gradient method via correlated perturbations was proposed in [17], though it cannot provide DP preservation. Different from these works, in this paper we remove the trustworthy servers assumption. Moreover, we take into consideration the distinct sensitive levels of data pieces and the diverse trust degrees of servers, and propose the PDML framework providing heterogeneous privacy preservation.

III Preliminaries and Problem Statement

In this section, we introduce the overall computation framework of DML and the ADMM algorithm used there. Moreover, the privacy-preserving problem for the framework is formulated with the definition of local differential privacy.

III-A System Setting

We consider a collaborative DML framework to carry out classification problems based on data collected from a large number of users. Fig. 1 gives a schematic diagram. There are two parties involved: Users (or data contributors) and computing servers. The DML’s goal is to train a classification model based on data of all users. It has two phases of data collection and distributed computation, called Phase 1 and Phase 2, respectively. In Phase 1, each user sends his/her data to the server, which is responsible to collect all the data from the user’s group. In Phase 2, each computing server utilizes a distributed computing approach to cooperatively train the classifier through information interaction with other servers. The proposed DML framework is based on the one in [7], but the learning tasks are much more complex than the basic statistical analysis considered by [7].

Network Model. Consider $n\geq 2$ computing servers participating in the framework where the $i$ th server is denoted by $s_{i}$ . We use an undirected and connected graph $\mathcal{G}=(\mathcal{S},\mathcal{E})$ to describe the underlying communication topology, where $\mathcal{S}=\{s_{i}\;|\;i=1,2,\ldots,n\}$ is the servers set and $\mathcal{E}\subseteq\mathcal{S}\times\mathcal{S}$ is the set of communication links between servers. The number of communication links in $\mathcal{G}$ is denoted by $E$ , i.e., $E=|\mathcal{E}|$ . Let the set of neighbor servers of $s_{i}$ be $\mathcal{N}_{i}=\{s_{l}\in\mathcal{S}\;|\;(s_{i},s_{l})\in\mathcal{E}\}$ . The degree of server $s_{i}$ is denoted by $N_{i}=|\mathcal{N}_{i}|$ .

Different servers collect data from different groups of users, and thus all users can be divided into $n$ distinct groups. The $i$ th group of users, whose data is collected by server $s_{i}$ , is denoted by the set $\mathcal{U}_{i}$ , and $m_{i}=|\mathcal{U}_{i}|$ is the number of users in $\mathcal{U}_{i}$ . Each user $j\in\mathcal{U}_{i}$ has a data sample $\mathbf{d}_{i,j}=(\mathbf{x}_{i,j},y_{i,j})\in\mathcal{X}\times\mathcal{Y}\subseteq\mathbb{R}^{d+1}$ , which is composed of a feature vector $\mathbf{x}_{i,j}\in\mathcal{X}\subseteq\mathbb{R}^{d}$ and the corresponding label $y_{i,j}\in\mathcal{Y}\subseteq\mathbb{R}$ . In this paper, we consider a binary-classification problem. That is, there are two types of labels as $y_{i,j}\in\{-1,1\}$ . Suppose that all data samples $\mathbf{d}_{i,j},\forall i,j$ , are drawn from an underlying distribution $\mathcal{P}$ , which is unknown to the servers. Here, the learning goal is that the classifier trained with limited data samples can match the ideal model trained with known $\mathcal{P}$ as much as possible.

III-B Classification Problem and ADMM Algorithm

We first introduce the classification problem solved by the two-phase DML framework. Let $\mathbf{w}:\mathcal{X}\rightarrow\mathcal{Y}$ be the trained classification model. The trained classifier $\mathbf{w}$ should guarantee that the accuracy of mapping any feature vector $\mathbf{x}_{i,j}$ (sampled from the distribution $\mathcal{P}$ ) to its correct label $y_{i,j}$ is high. We employ the method of regularized empirical risk minimization, which is a commonly used approach to find an appropriate classifier [34]. Denote the classifier trained by server $s_{i}$ as $\mathbf{w}_{i}\in\mathbb{R}^{d}$ . The objective function (or the empirical risk) of the minimization problem is defined as

[TABLE]

where $\ell:\mathbb{R}\times\mathbb{R}\rightarrow\mathbb{R}$ is the loss function measuring the performance of the trained classifier $\mathbf{w}_{i}$ . The regularizer $N(\mathbf{w}_{i})$ is introduced to mitigate overfitting, and $a>0$ is a constant. We take a bounded classifier class $\mathcal{W}\subset\mathbb{R}^{d}$ such that $\mathbf{w}_{i}\in\mathcal{W},\forall i$ . For the loss function $\ell(\cdot)$ and the regularizer $N(\cdot)$ , we introduce the following assumptions [14] [15].

Assumption 1.

The loss function $\ell(\cdot)$ is convex and doubly differentiable in $\mathbf{w}$ . In particular, $\ell(\cdot)$ , $\frac{\partial\ell(\cdot)}{\partial\mathbf{w}}$ and $\frac{\partial\ell^{2}(\cdot)}{\partial\mathbf{w}^{2}}$ are bounded over the class $\mathcal{W}$ as

[TABLE]

where $c_{1}$ , $c_{2}$ and $c_{3}$ are positive constants. Moreover, it holds $\frac{\partial\ell^{2}(y,\mathbf{w}^{\mathrm{T}}\mathbf{x})}{\partial{\mathbf{w}}^{2}}=\frac{\partial\ell^{2}(-y,\mathbf{w}^{\mathrm{T}}\mathbf{x})}{\partial{\mathbf{w}}^{2}}$ .

Assumption 2.

The regularizer $N(\cdot)$ is doubly differentiable and strongly convex with parameter $\kappa>0$ , i.e., $\forall\mathbf{w}_{1},\mathbf{w}_{2}\in\mathcal{W}$ ,

[TABLE]

where $\nabla N(\cdot)$ indicates the gradient with respect to $\mathbf{w}$ .

We note that $J(\{\mathbf{w}_{i}\}_{i\in\mathcal{S}})$ in (1) can be separated into $n$ different parts, where each part is the objective function of the local minimization problem to be solved by each server. The objective function of server $s_{i}$ is

[TABLE]

Since $\mathbf{w}_{i}$ is trained based on the data of the $i$ th group of users, it may only partially reflect data characteristics. To find a common classifier taking account of all participating users, we place a global consensus constraint in the minimization problem, as $\mathbf{w}_{i}=\mathbf{w}_{l},\forall s_{i},s_{l}\in\mathcal{S}$ . However, since we use a connected graph to describe the interaction between servers, we have to utilize a local consensus constraint:

[TABLE]

where $\mathbf{z}_{il}\in\mathbb{R}^{d}$ is an auxiliary variable enforcing consensus between neighbor servers $s_{i}$ and $s_{l}$ . Obviously, (4) also implies global consensus. We can now write the whole regularized empirical risk minimization problem as follows [35].

Problem 1.

[TABLE]

Next, we establish a compact form of Problem 1. Let $\mathbf{w}:=[\mathbf{w}_{1}^{\mathrm{T}}\cdots\mathbf{w}_{n}^{\mathrm{T}}]^{\mathrm{T}}\in\mathbb{R}^{nd}$ and $\mathbf{z}\in\mathbb{R}^{2Ed}$ be vectors aggregating all classifiers $\mathbf{w}_{i}$ and auxiliary variables $\mathbf{z}_{il}$ , respectively. To transfer all local consensus constraints into a matrix form, we introduce two block matrices $A_{1},A_{2}\in\mathbb{R}^{{2Ed}\times{nd}}$ , which are partitioned into $2E\times n$ submatrices with dimension $d\times d$ . For the communication link $(s_{i},s_{l})\in\mathcal{E}$ , if $\mathbf{z}_{il}$ is the $m$ th block of $\mathbf{z}$ , then the $(m,i)$ th submatrix of $A_{1}$ and $(m,l)$ th submatrix of $A_{2}$ are the $d\times d$ identity matrix $I_{d}$ ; otherwise, these submatrices are the $d\times d$ zero matrix $0_{d}$ . We write $J(\mathbf{w})=\sum_{i=1}^{n}J(\mathbf{w}_{i})$ , $A:=[A_{1}^{\mathrm{T}}A_{2}^{\mathrm{T}}]^{\mathrm{T}}$ , and $B:=[-I_{2Ed}\;-\!I_{2Ed}]^{\mathrm{T}}$ . Then, Problem 1 can be written in a compact form as

[TABLE]

For solving this problem we introduce the fully distributed ADMM algorithm from [26]. The augmented Lagrange function associated with (7) and (8) is given by $\mathcal{L}(\mathbf{w},\mathbf{z},\boldsymbol{\lambda}):=J(\mathbf{w})+\boldsymbol{\lambda}^{\mathrm{T}}(A\mathbf{w}+B\mathbf{z})+\frac{\beta}{2}\|A\mathbf{w}+B\mathbf{z}\|_{2}^{2}$ , where $\boldsymbol{\lambda}\in\mathbb{R}^{4Ed}$ is the dual variable ( $\mathbf{w}$ is correspondingly called the primal variable) and $\beta\in\mathbb{R}$ is the penalty parameter.

At iteration $t+1$ , the solved optimal auxiliary variable $\mathbf{z}$ satisfies the relation $\nabla\mathcal{L}(\mathbf{w}(t+1),\mathbf{z}(t+1),\boldsymbol{\lambda}(t))=0$ . Through some simple transformation, we have $B^{\mathrm{T}}\boldsymbol{\lambda}(t+1)=0$ . Let $\boldsymbol{\lambda}=[\boldsymbol{\xi}^{\mathrm{T}}\boldsymbol{\zeta}^{\mathrm{T}}]^{\mathrm{T}}$ with $\boldsymbol{\xi},\boldsymbol{\zeta}\in\mathbb{R}^{2Ed}$ . If we set the initial value of $\boldsymbol{\lambda}$ to $\boldsymbol{\xi}(0)=-\boldsymbol{\zeta}(0)$ , we have $\boldsymbol{\xi}(t)=-\boldsymbol{\zeta}(t),\forall t\geq 0$ . Thus, we can obtain the complete dual variable $\boldsymbol{\lambda}$ by solving $\boldsymbol{\xi}$ . Let

[TABLE]

Define a new dual variable $\boldsymbol{\gamma}:=(A_{1}-A_{2})^{\mathrm{T}}\boldsymbol{\xi}\in\mathbb{R}^{nd}$ . Through the simplification process in [26], we obtain the fully distributed ADMM for solving Problem 1, which is composed of the following iterations:

[TABLE]

Note that $\boldsymbol{\gamma}$ is also a compact vector of all local dual variables $\boldsymbol{\gamma}_{i}\in\mathbb{R}^{d}$ for $s_{i}\in\mathcal{S}$ , i.e., $\boldsymbol{\gamma}=[\boldsymbol{\gamma}_{1}^{\mathrm{T}}\cdots\boldsymbol{\gamma}_{n}^{\mathrm{T}}]^{\mathrm{T}}$ .

The above ADMM iterations can be separated into $n$ different parts, which are solved by the $n$ different servers. At iteration $t+1$ , the information used by server $s_{i}$ to update a new primal variable $\mathbf{w}_{i}(t+1)$ includes users’ data $\mathbf{d}_{i,j},\forall j$ , current classifiers $\left\{\mathbf{w}_{l}(t)\;|\;l\in\mathcal{N}_{i}\bigcup\{i\}\right\}$ and dual variable $\boldsymbol{\gamma}_{i}(t)$ . The local augmented Lagrange function $\mathcal{L}_{i}(\mathbf{w}_{i},\mathbf{w}_{i}(t),\boldsymbol{\gamma}_{i}(t))$ associated with the primal variable update is given by

[TABLE]

At each iteration, server $s_{i}$ will update its primal variable $\mathbf{w}_{i}(t+1)$ and dual variable $\boldsymbol{\gamma}_{i}(t+1)$ as follows:

[TABLE]

Clearly, in (9) and (10), the information communicated between computing servers is the newly updated classifiers.

III-C Privacy-preserving Problem

In this subsection, we introduce the privacy-preserving problem in the DML framework. The private information to be preserved is first defined, followed by the introduction of privacy violators and information used for privacy inference. Further, we present the objectives of the two phases.

Private information. For users, both the feature vectors and the labels of the data samples contain their sensitive information. The private information contained in the feature vectors may be the ID, gender, general health data and so on. However, the labels may indicate, for example, whether a patient contracts a disease (e.g., HIV-1 infected) or whether a user has a special identity (e.g., a member of a certain group). We can see that compared with the feature vectors, the labels may be more sensitive for the users. In this paper, we consider that the labels of users’ data are the most sensitive information, which should be protected with priority and obtain stronger privacy guarantee than that of feature vectors.

Privacy attacks. All computing servers are viewed as untrustworthy potential privacy violators desiring to infer the sensitive information contained in users’ data. In the meantime, different servers present distinct trust degrees to users. User $j\in\mathcal{U}_{i}$ divides the potential privacy violators into two types. The server $s_{i}$ , collecting user $j$ ’s data directly, is the first type. Other servers $s_{l}\in\mathcal{S},s_{l}\neq s_{i}$ , having no direct connection with user $j$ , are the second type. Compared with server $s_{i}$ , other servers are less trustworthy for user $j$ . To conduct privacy inference, the first type of privacy violators leverages user $j$ ’s reported data while the second type can utilize only the intermediate information shared by servers.

Privacy protections in Phases 1 & 2. Since the label of user $j\in\mathcal{U}_{i}$ is the most sensitive information, its original value should not be disclosed to any servers including server $s_{i}$ . Thus, during the data reporting process in Phase 1, user $j$ must obfuscate the private label in his/her local device. For the less sensitive feature vector, considering that server $s_{i}$ is more trustworthy, user $j$ can choose to transmit the original version to that server. Nevertheless, the user is still unwilling to disclose the raw feature vector to servers with lower trust degrees. Hence, in this paper, when server $s_{i}$ interacts with other servers to find a common classifier in Phase 2, the released information about user $j$ ’s data will be further processed before communication.

More specifically, in Phase 1, to obfuscate the labels, we use a local randomization approach, whose privacy-preserving property will be measured by local differential privacy (LDP) [30]. LDP is developed from differential privacy (DP), which is originally defined for trustworthy databases to publish aggregated private information [6]. The privacy preservation idea of DP is that for any two neighbor databases differing in one record (e.g., one user selects to report or not to report his/her data to the server) as input, a randomized mechanism is adopted to guarantee the two outputs to have high similarity so that privacy violators cannot identify the different record with high confidence. Since there is no trusted server for data collection in our setting, users locally perturb their original labels and report noisy versions to the servers.

To this end, we define a randomized mechanism $M:\mathbb{R}^{d+1}\rightarrow\mathbb{R}^{d+1}$ , which takes a data sample as input and outputs its noisy version. The definition of LDP is given as follows.

Definition 1.

( $\epsilon$ -LDP). Given $\epsilon>0$ , a randomized mechanism $M(\cdot)$ preserves $\epsilon$ -LDP if for any two data samples $\mathbf{d}_{1}=(\mathbf{x}_{1},y_{1})$ and $\mathbf{d}_{2}=(\mathbf{x}_{2},y_{2})$ satisfying $\mathbf{x}_{2}=\mathbf{x}_{1}$ and $y_{2}=-y_{1}$ , and any observation set $\mathcal{O}\subseteq\textrm{Range}(M)$ , it holds

[TABLE]

In (11), the parameter $\epsilon$ is called the privacy preserving degree (PPD), which describes the strength of privacy guarantee of $M(\cdot)$ . A smaller $\epsilon$ implies stronger privacy guarantee. This is because smaller $\epsilon$ means that the two outputs $M(\mathbf{d}_{1})$ and $M(\mathbf{d}_{2})$ are closer, making it more difficult for privacy violators to infer the difference in $\mathbf{d}_{1}$ and $\mathbf{d}_{2}$ (i.e., $y_{1}$ and $y_{2}$ ).

III-D System Overview

In this paper, we propose the PDML framework, where users can obtain heterogeneous privacy protection. The heterogeneity is characterized by two aspects: i) When a user faces a privacy violator, his/her data pieces with distinct sensitive levels (i.e., the feature vector and the label) obtain different privacy guarantees; ii) for one type of private data piece, the privacy protection provided by the framework is stronger against privacy violators with low trust degrees than those with higher trust degrees. Particularly, in our approach, the privacy preservation strength of users’ labels is controlled by the users. Moreover, a modified ADMM algorithm is proposed to meet the heterogeneous privacy protection requirement.

The workflow of the proposed PDML framework is illustrated in Fig. 2. Some details are explained below.

In Phase 1, a user first appropriately randomizes the private label, and then sends the noisy label and the original feature vector to a computing server. The randomization approach used here determines the PPD of the label. 2. 2.

In Phase 2, multiple computing servers collaboratively train a common classifier based on their collected data. To protect privacy of feature vectors against less trustworthy servers, we further use a combined noise-adding method to perturb the ADMM algorithm, which also strengthens the privacy guarantee of users’ labels. 3. 3.

The performance of the trained classifiers is analyzed in terms of their generalization errors. To decompose the effects of uncertainties introduced in the two phases, we modify the loss function in Problem 1. We finally quantify the difference between the generalization error of trained classifiers and that of the ideal optimal classifier.

IV Privacy-Preserving Framework Design

In this section, we introduce the privacy-preserving approaches used in Phases 1 and 2, and analyze their properties.

IV-A Privacy-Preserving Approach in Phase 1

In this subsection, we propose an appropriate approach used in Phase 1 to provide privacy preservation for the most sensitive labels. In particular, it is controlled by users and will not be weakened in Phase 2.

We adopt the idea of randomized response (RR) [30] to obfuscate the users’ labels. Originally, RR was used to set plausible deniability for respondents when they answer survey questions about sensitive topics (e.g., HIV-1 infected or uninfected). When using RR, respondents only have a certain probability to answer questions according to their true situations, making the server unable to determine with certainty whether the reported answers are true.

In our setting, user $j\in\mathcal{U}_{i}$ randomizes the label through RR and sends the noisy version to server $s_{i}$ . This is done by the randomized mechanism $M$ defined below.

Definition 2.

For $p\in(0,\frac{1}{2})$ , the randomized mechanism $M$ with input data sample $\mathbf{d}_{i,j}=(\mathbf{x}_{i,j},y_{i,j})$ is given by $M(\mathbf{d}_{i,j})=(\mathbf{x}_{i,j},y^{\prime}_{i,j})$ , where

[TABLE]

In the above definition, $p$ is the randomization probability controlling the level of data obfuscation. Obviously, a larger $p$ implies higher uncertainty on the reported label, making it harder for the server to learn the true label.

Denote the output $M(\mathbf{d}_{i,j})$ as $\mathbf{d}^{\prime}_{i,j}$ , i.e., $\mathbf{d}^{\prime}_{i,j}=M(\mathbf{d}_{i,j})=(\mathbf{x}_{i,j},y^{\prime}_{i,j})$ . After the randomization, $\mathbf{d}^{\prime}_{i,j}$ will be transmitted to the server. In this case, server $s_{i}$ can use only $\mathbf{d}^{\prime}_{i,j}$ to train the classifier, and the released information about the true label $y_{i,j}$ in Phase 2 is computed based on $\mathbf{d}^{\prime}_{i,j}$ . This implies that once $\mathbf{d}^{\prime}_{i,j}$ is reported to the server, no more information about the true label $y_{i,j}$ will be released. In this paper, we set the randomization probability $p$ in (12) as

[TABLE]

where $\epsilon>0$ . The following theorem gives the privacy-preserving property of the randomized mechanism in Definition 2, justifying this choice of $p$ from the viewpoint of LDP.

Proposition 1.

Under (13), the randomized mechanism $M(\mathbf{d}_{i,j})$ preserves $\epsilon$ -LDP for $\mathbf{d}_{i,j}$ .

The proof can be found in Appendix -A.

Proposition 1 clearly indicates that the users can tune the randomization probability according to their privacy demands. This can be seen as given a randomization probability $p$ , by (13), the PPD $\epsilon$ provided by $M(\mathbf{d}_{i,j})$ is $\epsilon=\ln\frac{1-p}{p}$ . Obviously, a larger randomization probability leads to smaller PPD, indicating stronger privacy guarantee.

If all data samples $\mathbf{d}_{i,j},\forall i,j$ , drawn from the distribution $\mathcal{P}$ are randomized through $M$ , the noisy data $\mathbf{d}^{\prime}_{i,j},\forall i,j$ , can be considered to be obtained from a new distribution $\mathcal{P}_{\epsilon}$ , which is related to the PPD $\epsilon$ . Note that $\mathcal{P}_{\epsilon}$ is also an unknown distribution due to the unknown $\mathcal{P}$ .

IV-B Privacy-Preserving Approach in Phase 2

To deal with less trustworthy servers in Phase 2, we devise a combined noise-adding approach to simultaneously preserve privacy for users’ feature vectors and enhance the privacy guarantee of users’ labels. We first adopt the method of objective function perturbation [14]. That is, before solving Problem 1, the servers perturb the objective function $J(\{\mathbf{w}_{i}\}_{i\in\mathcal{S}})$ with random noises. For server $s_{i}\in\mathcal{S}$ , the perturbed objective function is given by

[TABLE]

where $J_{i}(\mathbf{w}_{i})$ is the local objective function given in (3), and $\boldsymbol{\eta}_{i}\in\mathbb{R}^{d}$ is a bounded random noise with arbitrary distribution. Let $R$ be the bound of noises $\boldsymbol{\eta}_{i},\forall i$ , namely, $\|\boldsymbol{\eta}_{i}\|_{\infty}\leq R$ . Denote the sum of $\widetilde{J}_{i}(\mathbf{w}_{i})$ as $\widetilde{J}(\{\mathbf{w}_{i}\}_{i\in\mathcal{S}}):=\sum_{i=1}^{n}\widetilde{J}_{i}(\mathbf{w}_{i})$ .

Limitation of objective function perturbation. We remark that in our setting, the objective function perturbation in (14) is not sufficient to provide reliable privacy guarantee. This is because each server publishes current classifier multiple times and each publication utilizes users’ reported data. Note that in the more centralized setting of [14], the classifier is only published once. More specifically, according to (9), $\mathbf{w}_{i}(t+1)$ is the solution to $\nabla\mathcal{L}_{i}(\mathbf{w}_{i},\{\mathbf{w}_{l}(t)\}_{l\in\mathcal{N}_{i}\bigcup\{i\}},\boldsymbol{\gamma}_{i}(t))=0$ . In this case, it holds $\nabla\widetilde{J}_{i}(\mathbf{w}_{i}(t+1))=-\boldsymbol{\gamma}_{i}(t)+\beta\sum_{l\in\mathcal{N}_{i}}\left(\mathbf{w}_{i}(t)+\mathbf{w}_{l}(t)-2\mathbf{w}_{i}(t+1)\right)$ . As (10) shows, the dual variable $\boldsymbol{\gamma}_{i}(t)$ can be deduced from updated classifiers. Thus, if $s_{i}$ ’s neighbor servers have access to $\mathbf{w}_{i}(t+1)$ and $\left\{\mathbf{w}_{l}(t)\;|\;l\in\{i\}\bigcup\mathcal{N}_{i}\right\}$ , then they can easily compute $\nabla\widetilde{J}_{i}(\mathbf{w}_{i}(t+1))$ .

We should highlight that multiple releases of $\nabla\widetilde{J}_{i}(\mathbf{w}_{i}(t+1))$ increase the risk of users’ privacy disclosure. This can be explained as follows. First, note that $\nabla\widetilde{J}_{i}(\mathbf{w}_{i})=\nabla J_{i}(\mathbf{w}_{i})+\frac{1}{n}\boldsymbol{\eta}_{i}$ , where $\nabla J_{i}(\mathbf{w}_{i})$ contains users’ private information. The goal of $\boldsymbol{\eta}_{i}$ -perturbation is to protect $\nabla J_{i}(\mathbf{w}_{i})$ not to be derived directly by other servers. However, after publishing an updated classifier $\mathbf{w}_{i}(t+1)$ , server $s_{i}$ releases a new gradient $\nabla\widetilde{J}_{i}(\cdot)$ . Since the noise $\boldsymbol{\eta}_{i}$ is fixed for all iterations, each release of $\nabla\widetilde{J}_{i}(\cdot)$ means disclosing more information about $\nabla J_{i}(\cdot)$ . In particular, we have $\nabla\widetilde{J}_{i}(\mathbf{w}_{i}(t+1))-\nabla\widetilde{J}_{i}(\mathbf{w}_{i}(t))=\nabla J_{i}(\mathbf{w}_{i}(t+1))-\nabla J_{i}(\mathbf{w}_{i}(t))$ . That is, the effect of the added noise $\boldsymbol{\eta}_{i}$ can be cancelled by integrating the gradients of objective functions at different time instants.

Modified ADMM by primal variable perturbation. To ensure appropriate privacy preservation in Phase 2, we adopt an extra perturbation method, which sets obstructions for other servers to obtain the gradient $\nabla J_{i}(\cdot)$ . Specifically, after deriving classifier $\mathbf{w}_{i}(t)$ , server $s_{i}$ first perturbs $\mathbf{w}_{i}(t)$ with a Gaussian noise $\boldsymbol{\theta}_{i}(t)$ whose variance is decaying as iterations proceed, and then sends a noisy version of $\mathbf{w}_{i}(t)$ to neighbor servers. This is denoted by $\widetilde{\mathbf{w}}_{i}(t):=\mathbf{w}_{i}(t)+\boldsymbol{\theta}_{i}(t)$ , where $\boldsymbol{\theta}_{i}(t)\sim\mathcal{N}(0,\rho^{t-1}V_{i}^{2}I_{d})$ with decaying rate $0<\rho<1$ .

The local augmented Lagrange function associated with $\boldsymbol{\eta}_{i}$ -perturbed objective function $\widetilde{J}_{i}(\mathbf{w}_{i})$ in (14) is given by

[TABLE]

We then introduce the perturbed version of the ADMM algorithm in (9) and (10) as

[TABLE]

At iteration $t+1$ , a new classifier $\mathbf{w}_{i}(t+1)$ is first obtained by solving $\nabla\tilde{\mathcal{L}}_{i}(\mathbf{w}_{i},\widetilde{\mathbf{w}}_{i}(t),\boldsymbol{\gamma}_{i}(t))=0$ . Then, server $s_{i}$ will send $\widetilde{\mathbf{w}}_{i}(t+1)$ out and wait for the updated classifiers from neighbor servers. At the end of an iteration, the server will update the dual variable $\boldsymbol{\gamma}_{i}(t+1)$ .

IV-C Discussions

We now discuss the effectiveness of the primal variable perturbation. It is emphasized that at each iteration, $s_{i}$ only releases a small amount of information about $\nabla\widetilde{J}_{i}(\mathbf{w}_{i}(t+1))$ through the communicated $\widetilde{\mathbf{w}}_{i}(t+1)$ . Although $\boldsymbol{\gamma}_{i}(t)$ and $\left\{\widetilde{\mathbf{w}}_{l}(t)\;|\;l\in\{i\}\bigcup\mathcal{N}_{i}\right\}$ are known to $s_{i}$ ’s neighbors, $\nabla\widetilde{J}_{i}(\mathbf{w}_{i}(t+1))$ cannot be directly computed due to the unknown $\boldsymbol{\theta}_{i}(t+1)$ . More specifically, observe that by (15), we have $\nabla\widetilde{J}_{i}(\mathbf{w}_{i}(t+1))=-\boldsymbol{\gamma}_{i}(t)+\beta\sum_{l\in\mathcal{N}_{i}}\left(\widetilde{\mathbf{w}}_{i}(t)+\widetilde{\mathbf{w}}_{l}(t)\right)-2\beta N_{i}(\widetilde{\mathbf{w}}_{i}(t+1)-\boldsymbol{\theta}_{i}(t+1))$ , where $N_{i}$ is the degree of $s_{i}$ .

On the other hand, using available information, other servers can compute only $\nabla\widetilde{J}_{i}(\widetilde{\mathbf{w}}_{i}(t+1))$ , i.e., the gradient with respect to perturbed classifier $\widetilde{\mathbf{w}}_{i}(t+1)$ . We have $\nabla\widetilde{J}_{i}(\widetilde{\mathbf{w}}_{i}(t+1))=-\boldsymbol{\gamma}_{i}(t)-\beta\sum_{l\in\mathcal{N}_{i}}\left[2\widetilde{\mathbf{w}}_{i}(t+1)-(\widetilde{\mathbf{w}}_{i}(t)+\widetilde{\mathbf{w}}_{l}(t))\right]$ . Thus, we obtain $\nabla\widetilde{J}_{i}(\widetilde{\mathbf{w}}_{i}(t+1))-\nabla\widetilde{J}_{i}(\widetilde{\mathbf{w}}_{i}(t))=\nabla J_{i}(\mathbf{w}_{i}(t+1))-\nabla J_{i}(\mathbf{w}_{i}(t))-2\beta N_{i}(\boldsymbol{\theta}_{i}(t+1)-\boldsymbol{\theta}_{i}(t))$ . Hence, due to $\boldsymbol{\theta}_{i}$ , it would not be helpful for inferring $\nabla J_{i}(\cdot)$ to integrate the gradients of the objective functions at different iterations.

We should also observe that since $\lim_{t\rightarrow\infty}\boldsymbol{\theta}_{i}(t+1)=0$ , $\nabla\widetilde{J}_{i}(\mathbf{w}_{i}(t+1))$ can be derived when $t\rightarrow\infty$ . Moreover, it is clear that the relation $\nabla\widetilde{J}_{i}(\widetilde{\mathbf{w}}_{i}(t+1))-\nabla\widetilde{J}_{i}(\widetilde{\mathbf{w}}_{i}(t))=\nabla J_{i}(\mathbf{w}_{i}(t+1))-\nabla J_{i}(\mathbf{w}_{i}(t))$ holds for $t\rightarrow\infty$ . However, $\nabla\widetilde{J}_{i}(\cdot)$ is the result of $\nabla J_{i}(\cdot)$ under $\boldsymbol{\eta}_{i}$ -perturbation. Moreover, due to the local consensus constraint (4), the trained classifiers $\mathbf{w}_{i}(t)$ may not have significant differences when $t\rightarrow\infty$ . Such limited information is not sufficient for privacy violators to infer $\nabla J_{i}(\cdot)$ with high confidence.

Differential privacy analysis. We remark that in our scheme, the noise $\boldsymbol{\eta}_{i}$ added to the objective function provides underlying privacy protection in Phase 2. Even if privacy violators make inference with published $\widetilde{\mathbf{w}}_{i}$ in all iterations, the disclosed information is users’ reported data plus extra noise perturbation. If the objective function perturbation is removed, the primal variable perturbation method cannot provide DP guarantee when $t\rightarrow\infty$ . It is proved in [15] and [16] that the $\mathbf{w}_{i}$ -perturbation in (16) preserves dynamic DP. According to the composition theorem of DP [6], the PPD will increase (indicating weaker privacy guarantee) when other servers obtain the perturbed classifiers $\widetilde{\mathbf{w}}_{i}$ of multiple iterations. In particular, if the perturbed classifiers in all iterations are used for inference, the PPD will be $\infty$ , implying no privacy guarantee any more.

Remark 1.

The objective function perturbation given in (14) preserves the so-called $(\epsilon_{p},\delta_{p})$ -DP [36]. Also, according to [14], the perturbation in (14) preserves $\epsilon_{2}$ -DP if $\boldsymbol{\eta}_{i}$ has density $f(\boldsymbol{\eta}_{i})=\frac{1}{\nu}e^{-\epsilon_{2}\|\boldsymbol{\eta}_{i}\|}$ with normalizing parameter $\nu$ . Note that the noise with this density is not bounded, which is not consistent with our setting. Although we use a bounded noise, this kind of perturbation still provides $(\epsilon_{p},\delta_{p})$ -DP guarantee, which is a relaxed form of pure $\epsilon_{p}$ -DP.

Strengthened privacy guarantee. For users’ labels, the privacy guarantee in Phase 2 is stronger than that of Phase 1. Since differential privacy is immune to post-processing [6], the PPD $\epsilon$ in Phase 1 will not increase during the iterations of the ADMM algorithm executed in Phase 2. However, such immunity is established based on a strong assumption that there is no limit to the capability of privacy violators. In our considered problem, this assumption is satisfied when all servers can have access to user $j$ ’s reported data $\mathbf{d}^{\prime}_{i,j}$ , which may not be realistic. Hence, in our problem setting, one server (i.e., server $s_{i}$ ) obtains $\mathbf{d}^{\prime}_{i,j}$ while other servers can access only the classifiers trained with users’ reported data.

Remark 2.

The $(\epsilon_{p},\delta_{p})$ -DP guarantee is provided for users’ feature vectors. Thus, in Phase 2, the sensitive information in those vectors is not disclosed much to the servers with lower trust degrees. For the labels, they obtain extra $(\epsilon_{p},\delta_{p})$ -DP preservation in Phase 2. Since the privacy-preserving scheme in Phase 1 preserves $\epsilon$ -DP for the labels, the released information about them in Phase 2 provides stronger privacy guarantee under the joint effect of $\epsilon$ -DP in Phase 1 and $(\epsilon_{p},\delta_{p})$ -DP in Phase 2. We will investigate the joint privacy-preserving degree in the future.

V Performance Analysis

In this section, we analyze the performance of the classifiers trained by the proposed PDML framework. Note that three different uncertainties are introduced into the ADMM algorithm, and these uncertainties are coupled together. The difficulty in analyzing the performance lies in decomposing the effects of the three uncertainties and quantifying the role of each uncertainty. Further, it is also challenging to achieve perturbations mitigation on the trained classifiers, especially to mitigate the influence of users’ wrong labels.

Here, we first give the definition of generalization error as the metric on the performance of the trained classifiers. Then, we establish a modified version of the loss function $\ell(\cdot)$ , which simultaneously achieves uncertainty decomposition and mitigation of label obfuscation. We finally derive a theoretical bound for the difference between the generalization error of trained classifiers and that of the ideal optimal classifier.

V-A Performance Metric

To measure the quality of trained classifiers, we use generalization error for analysis, which describes the expected error of a classifier on future predictions [37]. Recall that users’ data samples are drawn from the unknown distribution $\mathcal{P}$ . The generalization error of a classifier $\mathbf{w}$ is defined as the expectation of $\mathbf{w}$ ’s loss function with respect to $\mathcal{P}$ as $\mathbb{E}_{(\mathbf{x},y)\sim\mathcal{P}}\left[\ell(y,\mathbf{w}^{\mathrm{T}}\mathbf{x})\right]$ . Further, define the regularized generalization error by

[TABLE]

We denote the classifier minimizing $J_{\mathcal{P}}(\mathbf{w})$ as $\mathbf{w}^{\star}$ , i.e., $\mathbf{w}^{\star}:=\arg\min_{\mathbf{w}\in\mathcal{W}}J_{\mathcal{P}}(\mathbf{w})$ . We call $\mathbf{w}^{\star}$ the ideal optimal classifier.

Here, $J_{\mathcal{P}}(\mathbf{w}^{\star})$ is the reference regularized generalization error under the classifier class $\mathcal{W}$ and the used loss function $\ell(\cdot)$ . The trained classifier can be viewed as a good predictor if it achieves generalization error close to $J_{\mathcal{P}}(\mathbf{w}^{\star})$ . Thus, as the performance metric of the classifiers, we use the difference between the generalization error of trained classifiers and $J_{\mathcal{P}}(\mathbf{w}^{\star})$ . The difference is denoted as $\Delta J_{\mathcal{P}}(\mathbf{w})$ , that is, $\Delta J_{\mathcal{P}}(\mathbf{w}):=J_{\mathcal{P}}(\mathbf{w})-J_{\mathcal{P}}(\mathbf{w}^{\star})$ .

Furthermore, to measure the performance of the classifiers trained by different servers at multiple iterations, we introduce a comprehensive metric. First, considering that the classifiers $\mathbf{w}_{i}$ solved by server $s_{i}$ at different iterations may be different until the consensus constraint (4) is satisfied, we define a classifier $\overline{\mathbf{w}}_{i}(t)$ to aggregate $\mathbf{w}_{i}$ in the first $t$ rounds as $\overline{\mathbf{w}}_{i}(t):=\frac{1}{t}\sum_{k=1}^{t}\mathbf{w}_{i}(k)$ , where $\mathbf{w}_{i}(k)$ is the obtained classifier by solving (15). Moreover, due to the diversity of users’ reported data, the classifiers solved by different servers may also differ (especially in the initial iterations). For this reason, we will later study the accumulated difference among the $n$ servers, that is, $\sum_{i=1}^{n}\Delta J_{\mathcal{P}}(\overline{\mathbf{w}}_{i}(t))$ .

V-B Modified Loss Function in ADMM Algorithm

To mitigate the effect of label obfuscation executed in Phase 1, we make some modification to the loss function $\ell(\cdot)$ in Problem 1. We use the noisy labels and the corresponding PPD $\epsilon$ in Phase 1 to adjust the loss function $\ell(\cdot)$ in (5). (Note that other parts of Problem 1 are not affected by the noisy labels.) Define the modified loss function $\hat{\ell}(y^{\prime}_{i,j},\mathbf{w}_{i}^{\mathrm{T}}\mathbf{x}_{i,j},\epsilon)$ by

[TABLE]

This function has the following properties.

Proposition 2.

(i)

$\hat{\ell}(y^{\prime}_{i,j},\mathbf{w}_{i}^{\mathrm{T}}\mathbf{x}_{i,j},\epsilon)$ * is an unbiased estimate of $\ell(y_{i,j},\mathbf{w}_{i}^{\mathrm{T}}\mathbf{x}_{i,j})$ as*

[TABLE] 2. (ii)

$\hat{\ell}(y^{\prime}_{i,j},\mathbf{w}_{i}^{\mathrm{T}}\mathbf{x}_{i,j},\epsilon)$ * is Lipschitz continuous with Lipschitz constant*

[TABLE]

where $c_{2}$ is the bound of $\left|\frac{\partial\ell(\cdot)}{\partial\mathbf{w}_{i}}\right|$ given in Assumption 1.

The proof can be found in Appendix -B.

Now, we make server $s_{i}$ use $\hat{\ell}(y^{\prime}_{i,j},\mathbf{w}_{i}^{\mathrm{T}}\mathbf{x}_{i,j},\epsilon)$ in (19) as the loss function. Thus, the objective function in (3) must be replaced with the one as follows:

[TABLE]

Similar to $J(\{\mathbf{w}_{i}\}_{i\in\mathcal{S}})$ in (1), we denote the objective function with the modified loss function as $\widehat{J}(\{\mathbf{w}_{i}\}_{i\in\mathcal{S}}):=\sum_{i=n}^{n}\widehat{J}_{i}(\mathbf{w}_{i})$ . Then, the following lemma holds, whose proof can be found in Appendix -C.

Lemma 1.

If the loss function $\ell(\cdot)$ and the regularizer $N(\cdot)$ satisfy Assumptions 1 and 2, respectively, then $\widehat{J}(\{\mathbf{w}_{i}\}_{i\in\mathcal{S}})$ is $a\kappa$ -strongly convex.

To simplify the notation, let $\hat{\kappa}:=a\kappa$ . With the objective function $\widehat{J}(\{\mathbf{w}_{i}\}_{i\in\mathcal{S}})$ , the whole optimization problem for finding a common classifier can be stated as follows:

Problem 2.

[TABLE]

Lemma 2.

Problem 2 has an optimal solution set $\{\widehat{\mathbf{w}}_{i}\}_{i\in\mathcal{S}}\subset\mathcal{W}$ such that $\widehat{\mathbf{w}}_{\mathrm{opt}}=\widehat{\mathbf{w}}_{i}=\widehat{\mathbf{w}}_{l},\forall i,l$ .

Lemma 2 can be proved directly from Lemma 1 in [35], whose condition is satisfied by Lemma 1.

We finally arrive at stating the optimization problem to be solved in this paper. To this end, for the modified objective function in (22), we define the perturbed version as in (14) by $\widetilde{J}_{i}(\mathbf{w}_{i}):=\widehat{J}_{i}(\mathbf{w}_{i})+\frac{1}{n}\boldsymbol{\eta}_{i}^{\mathrm{T}}\mathbf{w}_{i}$ . Then, the whole objective function becomes

[TABLE]

The problem for finding the classifier with randomized labels and perturbed objective functions is as follows:

Problem 3.

[TABLE]

For $\widetilde{J}(\{\mathbf{w}_{i}\}_{i\in\mathcal{S}})$ , we have the following lemma showing its convexity properties.

Lemma 3.

$\widetilde{J}(\{\mathbf{w}_{i}\}_{i\in\mathcal{S}})$ * is $\hat{\kappa}$ -strongly convex. If $N(\cdot)$ satisfies that $\|\nabla^{2}N(\cdot)\|_{2}\leq\varrho$ , then $\widetilde{J}(\{\mathbf{w}_{i}\}_{i\in\mathcal{S}})$ has a $(nc_{3}+a\varrho)$ -Lipschitz continuous gradient, where $c_{3}$ is the bound of $\frac{\partial\ell^{2}(\cdot)}{\partial\mathbf{w}^{2}}$ given in Assumption 1.*

The proof can be found in Appendix -D. For simplicity, we denote the Lipschitz continuous gradient of $\widetilde{J}(\mathbf{w})$ as $\varrho_{\widetilde{J}}$ , namely, $\varrho_{\widetilde{J}}:=nc_{3}+a\varrho$ .

We now observe that Problem 3 associated with the objective function $\widetilde{J}(\{\mathbf{w}_{i}\}_{i\in\mathcal{S}})$ has an optimal solution set $\{\widetilde{\mathbf{w}}_{i}\}_{i\in\mathcal{S}}\subset\mathcal{W}$ where

[TABLE]

In fact, this can be shown by an argument similar to Lemma 2, where Lemma 3 establishes the convexity of the objective function (as in Lemma 1).

V-C Generalization Error Analysis

In this subsection, we analyze the the accumulated difference between the generalization error of trained classifiers and $J_{\mathcal{P}}(\mathbf{w}^{\star})$ , i.e., $\sum_{i=1}^{n}\Delta J_{\mathcal{P}}(\overline{\mathbf{w}}_{i}(t))$ . For the analysis, we use the technique from [38], which considers the problem of ADMM learning in the presence of erroneous updates. Here, our problem is more complicated because besides the erroneous updates brought by primal variable perturbation, there is also uncertainty in the training data and the objective functions. All these uncertainties are coupled together, which brings extra challenges for performance analysis.

We first decompose $\Delta J_{\mathcal{P}}(\overline{\mathbf{w}}_{i}(t))$ in terms of different uncertainties. To do so, we must introduce a new regularized generalization error associated with the modified loss function $\hat{\ell}(y^{\prime},\mathbf{w}^{\mathrm{T}}\mathbf{x},\epsilon)$ and the noisy data distribution $\mathcal{P}_{\epsilon}$ . Similar to (18), for a classifier $\mathbf{w}$ , it is defined by

[TABLE]

According to Proposition 2, $\hat{\ell}(y^{\prime},\mathbf{w}^{\mathrm{T}}\mathbf{x},\epsilon)$ is an unbiased estimate of $\ell(y,\mathbf{w}^{\mathrm{T}}\mathbf{x})$ . Thus, it is straightforward to obtain the following lemma, whose proof is omitted.

Lemma 4.

For a classifier $\mathbf{w}$ , we have $J_{\mathcal{P}_{\epsilon}}(\mathbf{w})=J_{\mathcal{P}}(\mathbf{w})$ .

Now, we can decompose $\Delta J_{\mathcal{P}}(\overline{\mathbf{w}}_{i}(t))$ as follows:

[TABLE]

We will analyze each term in the far right-hand side of (24). The term $\widetilde{\mathbf{w}}_{\mathrm{opt}}-\overline{\mathbf{w}}_{i}(t)$ describes the difference between the classifier $\overline{\mathbf{w}}_{i}(t)$ and the optimal solution $\widetilde{\mathbf{w}}_{\mathrm{opt}}$ to Problem 3. Before analyzing this difference, we first consider the deviation between the perturbed classifier $\widetilde{\mathbf{w}}_{i}(t)$ and $\widetilde{\mathbf{w}}_{\mathrm{opt}}$ , and a bound for it can be obtained by [38].

Here, we introduce some notations related to the bound. Let the compact forms of vectors be $\widetilde{\mathbf{w}}(t):=[\widetilde{\mathbf{w}}_{1}^{\mathrm{T}}(t)\cdots\widetilde{\mathbf{w}}_{n}^{\mathrm{T}}(t)]^{\mathrm{T}}$ , $\boldsymbol{\theta}(t):=[\boldsymbol{\theta}_{1}^{\mathrm{T}}(t)\cdots\boldsymbol{\theta}_{n}^{\mathrm{T}}(t)]^{\mathrm{T}}$ , and $\boldsymbol{\eta}:=[\boldsymbol{\eta}_{1}^{\mathrm{T}}\cdots\boldsymbol{\eta}_{n}^{\mathrm{T}}]^{\mathrm{T}}$ . Also, let $\widehat{\mathbf{w}}^{*}:=[I_{d}\cdots I_{d}]^{\mathrm{T}}\cdot\widehat{\mathbf{w}}_{\mathrm{opt}}$ , $\widetilde{\mathbf{w}}^{*}:=[I_{d}\cdots I_{d}]^{\mathrm{T}}\cdot\widetilde{\mathbf{w}}_{\mathrm{opt}}$ , and $\overline{L}:=\frac{1}{2}(L_{+}+L_{-})$ . An auxiliary sequence $\mathbf{r}(t)$ is defined as $\mathbf{r}(t):=\sum_{k=0}^{t}Q\widetilde{\mathbf{w}}(k)$ with $Q:=\bigl{(}\frac{L_{-}}{2}\bigr{)}^{\frac{1}{2}}$ [39]. $\mathbf{r}(t)$ has an optimal value $\mathbf{r}_{\mathrm{opt}}$ , which is the solution to the equation $Q\mathbf{r}_{\mathrm{opt}}+\frac{1}{2\beta}\nabla\widetilde{J}(\widetilde{\mathbf{w}}_{\mathrm{opt}})=0$ .

Further, we define some important parameters to be used in the next lemma. The first two parameters, $b\in(0,1)$ and $\lambda_{1}>1$ , are related to the underlying network topology $\mathcal{G}$ and will be used to establish convergence property of the perturbed ADMM algorithm. Let $\varphi:=\frac{\lambda_{1}-1}{\lambda_{1}}\frac{2\hat{\kappa}\sigma_{\min}^{2}(Q)\sigma_{\min}^{2}(L_{+})}{\varrho_{\widetilde{J}}^{2}\sigma_{\min}^{2}(L_{+})+2\hat{\kappa}\sigma_{\max}^{2}(L_{+})}$ , where $\sigma_{\max}(\cdot)$ and $\sigma_{\min}(\cdot)$ denote the maximum and minimum nonzero eigenvalues of a matrix, respectively. Also, we define $M_{1}$ and $M_{2}$ with constant $\lambda_{2}>1$ as

[TABLE]

Then, we have the following lemma from [38], which gives a bound for $\widetilde{\mathbf{w}}(t)-\widetilde{\mathbf{w}}^{*}$ .

Lemma 5.

Suppose that the conditions of Lemma 3 hold. If the parameters $b$ and $\lambda_{1}$ can be chosen such that

[TABLE]

Take $\beta$ in (17) as $\beta=\sqrt{\frac{\lambda_{1}\lambda_{3}(\lambda_{4}-1)\varrho_{\widetilde{J}}^{2}}{\lambda_{4}(\lambda_{1}-1)\sigma_{\max}^{2}(L_{+})\sigma_{\min}^{2}(Q)}}$ , where $\lambda_{4}:=1+\sqrt{\frac{\varrho_{\widetilde{J}}^{2}\sigma_{\min}^{2}(L_{+})+2\hat{\kappa}\sigma_{\max}^{2}(L_{+})}{\alpha\lambda_{3}\varrho_{\widetilde{J}}^{2}\sigma_{\min}^{2}(L_{+})}}$ with $0<\alpha<\min\{M_{1},M_{2}\}$ , and $\lambda_{3}:=1+\frac{2\hat{\kappa}\sigma_{\max}^{2}(L_{+})}{\varrho_{\widetilde{J}}^{2}\sigma_{\min}^{2}(L_{+})}$ . Then, it holds

[TABLE]

where $C:=\frac{(1+4\alpha)\sigma_{\max}^{2}(L_{+})}{(1-b)(1+\varphi-4\alpha)\sigma_{\min}^{2}(L_{+})}$ , and $H_{1}:=\left\|\mathbf{w}(0)-\widetilde{\mathbf{w}}^{*}\right\|_{2}^{2}+\frac{4}{(1+4\alpha)\sigma_{\max}^{2}(L_{+})}\left\|\mathbf{r}(0)-\mathbf{r}_{\mathrm{opt}}\right\|_{2}^{2}$ , $H_{2}:=\frac{b(\lambda_{2}-1)}{1-b}+\frac{\frac{4\varphi\lambda_{1}\sigma_{\max}^{2}(\overline{L})}{\sigma_{\min}^{2}(Q)}+\sigma_{\max}^{2}(L_{+})\left(\sqrt{\varphi}+\sqrt{\frac{2(\lambda_{1}-1)\sigma_{\min}^{2}(Q)}{\alpha\lambda_{1}\lambda_{3}\varrho_{\widetilde{J}}^{2}}}\right)^{2}}{(1-b)(1+\varphi)(1+\varphi-4\alpha)\sigma_{\min}^{2}(L_{+})}$ .

Lemma 5 implies that given a connected graph $\mathcal{G}$ and the objective function in Problem 3, if the parameters $b$ and $\lambda_{1}$ satisfy (25), then $C$ in (26) is guaranteed to be less than 1. In this case, the obtained classifiers will converge to the neighborhood of the optimal solution $\widetilde{\mathbf{w}}_{\mathrm{opt}}$ , where the radius of the neighborhood is $\lim_{t\rightarrow\infty}\sum_{k=1}^{t}C^{t-k}H_{2}\|\boldsymbol{\theta}(k)\|_{2}^{2}$ . The modified ADMM algorithm can achieve different radii depending on the added noises $\boldsymbol{\theta}(k)$ . Since many parameters are involved, to meet the condition (25) may not be straightforward. In order to make $C$ smaller to achieve better convergence rate, in addition to the parameters, one may change, for example, the graph $\mathcal{G}$ to make the value $\frac{\sigma_{\max}^{2}(L_{+})}{\sigma_{\min}^{2}(L_{+})}$ smaller.

Theorem 1 to be stated below gives the upper bound of the accumulated difference $\sum_{i=1}^{n}\Delta J_{\mathcal{P}}(\overline{\mathbf{w}}_{i}(t))$ in the sense of expectation. In the theorem, we employ the important concept of Rademacher complexity [40]. It is defined on the classifier class $\mathcal{W}$ and the collected data used for training, that is, $\mathrm{Rad}_{i}(\mathcal{W}):=\frac{1}{m_{i}}\mathbb{E}_{\nu_{j}}\left[\sup_{\mathbf{w}\in\mathcal{W}}\sum_{j=1}^{m_{i}}\nu_{j}\mathbf{w}^{\mathrm{T}}\mathbf{x}_{i,j}\right]$ , where $\nu_{1},\nu_{2},\ldots,\nu_{m_{i}}$ are independent random variables drawn from the Rademacher distribution, i.e., $\Pr(\nu_{j}=1)=\Pr(\nu_{j}=-1)=\frac{1}{2}$ for $j=1,2,\ldots,m_{i}$ . In addition, we use the notation $\|\mathbf{v}\|_{A}^{2}$ to denote the norm of a vector $\mathbf{v}$ with a positive definite matrix $A$ , i.e., $\|\mathbf{v}\|_{A}^{2}=\mathbf{v}^{\mathrm{T}}A\mathbf{v}$ .

Theorem 1.

Suppose that the conditions in Lemma 5 are satisfied and the decaying rate of noise variance is set as $\rho\in(0,C)$ . Then, for $\epsilon>0$ and $\delta\in(0,1)$ , the aggregated classifier $\overline{\mathbf{w}}_{i}(t)$ obtained by the privacy-aware ADMM scheme (15)-(17) satisfies with probability at least $1-\delta$

[TABLE]

where $H_{3}=\|\mathbf{r}(0)\|_{2}^{2}+\|\mathbf{w}(0)-\widetilde{\mathbf{w}}^{*}\|_{\frac{L_{+}}{2}}^{2}$ , and the parameters $C$ , $H_{1}$ , $H_{2}$ and $\beta$ are found in Lemma 5.

Proof.

In what follows, we evaluate the terms in the far right-hand side of (24) by dividing them into three groups. The first is the terms $J_{\mathcal{P}_{\epsilon}}(\overline{\mathbf{w}}_{i}(t))-\widehat{J}_{i}(\overline{\mathbf{w}}_{i}(t))+\widehat{J}_{i}({\mathbf{w}^{\star}})-J_{\mathcal{P}_{\epsilon}}(\mathbf{w}^{\star})$ . We can bound them from above as

[TABLE]

According to Theorem 26.5 in [40], with probability at least $1-\delta$ , we have

[TABLE]

where $\mathrm{Rad}_{i}(\hat{\ell}\circ\mathcal{W})$ is the Rademacher complexity of $\mathcal{W}$ with respect to $\hat{\ell}$ . Further, by the contraction lemma in [40],

[TABLE]

where we have used Proposition 2. Also, from (19), we derive

[TABLE]

where $c_{1}$ is the bound of the original loss function $\ell(\cdot)$ (Assumption 1). Then, it follows that

[TABLE]

The second group in (24) are the terms about $\widetilde{J}_{i}(\cdot)$ and $\widehat{J}_{i}(\cdot)$ . In their aggregated forms, by Lemma 2, it holds

[TABLE]

where we have used Jensen’s inequality given the strongly convex $\widetilde{J}(\cdot)$ . For the first two terms in (29), by Theorem 1 of [38], we have

[TABLE]

Take the expectation on both sides of (30) with respect to $\boldsymbol{\theta}(k)$ . Given $\mathbb{E}_{\{\boldsymbol{\theta}(k)\}}\left\{\|\boldsymbol{\theta}(k)\|_{2}^{2}\right\}=\sum_{i=1}^{n}dV_{i}^{2}\rho^{k-1}$ , we derive

[TABLE]

where we used $\mathbb{E}\left\{\boldsymbol{\theta}(k)\right\}=0$ and $\mathbb{E}_{\{\boldsymbol{\theta}(k)\}}\left\{\boldsymbol{\theta}(k-1)\boldsymbol{\theta}(k)\right\}=0$ . Thus, it follows that

[TABLE]

Then, for (30), we arrive at

[TABLE]

Next, we focus on the latter two terms in (29). Due to (23), we have $\widetilde{J}(\widetilde{\mathbf{w}}^{*})\leq\widetilde{J}(\widehat{\mathbf{w}}^{*})$ , which yields

[TABLE]

By Lemma 7 in [14], we obtain $\|\widetilde{\mathbf{w}}^{*}-\widehat{\mathbf{w}}^{*}\|\leq\frac{1}{n}\frac{\|\boldsymbol{\eta}\|}{\hat{\kappa}}$ . It follows

[TABLE]

where $R$ is the bound of noise $\boldsymbol{\eta}_{i}$ . Substituting (31) and (32) into (29), we derive

[TABLE]

The third group in (24) is the term $\boldsymbol{\eta}^{\mathrm{T}}(\widetilde{\mathbf{w}}^{*}-\overline{\mathbf{w}}(t))$ . We have

[TABLE]

Taking the expectation with respect to $\boldsymbol{\theta}(k)$ , we obtain

[TABLE]

By Lemma 5, we have

[TABLE]

Then, it follows that

[TABLE]

where we have used $0<\rho<C$ . Substituting (28), (33) and (34) into (24), we arrive at the bound in (27). ∎

Theorem 1 provides a guidance for both users and servers to obtain a classification model with desired performance. In particular, the effects of three uncertainties on the bound of $\sum_{i=1}^{n}\Delta J_{\mathcal{P}}(\overline{\mathbf{w}}_{i}(t))$ have been successfully decomposed. Note that these effects are not simply superimposed but coupled together. Specifically, the terms in (27) related to the primal variable perturbation decrease with iterations at the rate of $O\left(\frac{1}{t}\right)$ . This also implies that the whole framework achieves convergence in expectation at this rate.

Compared with [16] and [38], where bounds of $\frac{1}{t}\sum_{k=1}^{t}\widetilde{J}(\mathbf{w}(k))-\widetilde{J}(\widetilde{\mathbf{w}}^{*})$ are provided, we derive the difference between the generalization error of the aggregated classifier $\overline{\mathbf{w}}(t)$ and that of the ideal optimal classifier $\mathbf{w}^{\star}$ , which is moreover given in a closed form. The bound in (27) contains the effect of the unknown data distribution $\mathcal{P}$ while the bound of $\frac{1}{t}\sum_{k=1}^{t}\widetilde{J}(\mathbf{w}(k))-\widetilde{J}(\widetilde{\mathbf{w}}^{*})$ covers only the role of existing data. Although [15] also considers the generalization error of found classifiers, no closed form of the bound is given, and the obtained bound may not decrease with iterations since the reference classifier therein is not $\mathbf{w}^{\star}$ but a time-varying one. In the more centralized setting of [14], $\Delta J_{\mathcal{P}}(\mathbf{w})$ is analyzed for the derived classifier $\mathbf{w}$ , but there is no convergence issue since $\mathbf{w}$ is perturbed and published only once.

Moreover, different from the works [14, 15, 16] and [38], our analysis considers the effects of the classifier class $\mathcal{W}$ by Rademacher complexity. Such effects have been used in [40] in non-private centralized machine learning scenarios. Furthermore, in the privacy-aware (centralized or distributed) frameworks of [14, 15, 16] and the robust ADMM scheme for erroneous updates [38], there is only one type of noise perturbation, and the uncertainty in the training data is not considered.

V-D Comparisons and Discussions

Here, we compare the proposed framework with existing schemes from the perspective of privacy and performance, and discuss how each parameter contributes to the results.

First, we find that the bound in (27) is larger than those in [14, 15, 16] if we adopt the approach in this paper to conduct performance analysis on these works. This is obvious since there are more perturbations in our setting. However, as we have discussed in Section IV-C, these existing frameworks do not meet the heterogeneous privacy requirements, and some of them cannot avoid accumulation of privacy losses, resulting in no protection at all. It should be emphasized that extra performance costs must be paid when the data contributors want to obtain stronger privacy guarantee. These existing frameworks may be better than ours in the sense of performance, but the premise is that users accept the privacy preservation provided by them. If users require heterogenous privacy protection, our framework can be more suitable.

Further, compared with [14, 15, 16], [38] and [40], we provide a more systematic result on the performance analysis in Theorem 1, where most parameters related to useful measures of classifiers (also privacy preservation) are included. Servers and users can set these parameters as needed, and thus obtain classifiers which can appropriately balance the privacy and the performance. We will discuss the roles of these parameters after some further analysis on the theoretical result.

According to Lemma 5, the classifiers solved by different servers converge to $\widetilde{\mathbf{w}}_{\mathrm{opt}}$ in the sense of expectation. The performance of $\widetilde{\mathbf{w}}_{\mathrm{opt}}$ can be analyzed in a similar way as in Theorem 1. This is given in the following corollary.

Corollary 1.

For $\epsilon>0$ and $\delta\in(0,1)$ , with probability at least $1-\delta$ , we have

[TABLE]

For the sake of comparison, the next theorem provides a performance analysis when the privacy-preserving approach in Phase 2 is removed, and a corresponding result on the bound of $\Delta J_{\mathcal{P}}(\widehat{\mathbf{w}}_{\mathrm{opt}})$ is given in the subsequent corollary.

Theorem 2.

For $\epsilon>0$ and $\delta\in(0,1)$ , the aggregated classifier $\overline{\mathbf{w}}_{i}(t)$ obtained by the original ADMM scheme (9) and (10) satisfies with probability at least $1-\delta$

[TABLE]

Corollary 2.

For $\epsilon>0$ and $\delta\in(0,1)$ , with probability at least $1-\delta$ , we have

[TABLE]

It is observed that the bound in (36) is not in expectation since there is no noise perturbation during the ADMM iterations. It is interesting to note that the convergence rate of the unperturbed ADMM algorithm is also $O(\frac{1}{t})$ . This implies that the modified ADMM algorithm preserves the convergence speed of the general distributed ADMM scheme.

However, there exists a tradeoff between performance and privacy protection. Comparing (27) and (36), we find that the extra terms in (27) are the results of perturbations in Phase 2. Also, the effect of the objective function perturbation is reflected in (35), that is, the term $\frac{1}{n\hat{\kappa}}R^{2}$ . When $R$ (the bound of $\boldsymbol{\eta}_{i}$ ) increases, the generalization error of the trained classifier would increase as well, indicating worse performance. Similarly, if we use noises with larger initial variances and decaying rates to perturb the solved classifiers in each iteration, the bound in (27) will also increase.

Effect of data quality. We observe that the bound of $\Delta J_{\mathcal{P}}(\widehat{\mathbf{w}}_{\mathrm{opt}})$ in (37) also appears in (27), (35) and (36). This bound reflects the effect of users’ reported data, whose labels are randomized in Phase 1. It can be seen that besides the probability $\delta$ , the bound in (37) is affected by three factors: PPD $\epsilon$ , Rademacher complexity $\mathrm{Rad}_{i}(\mathcal{W})$ , and the number of data samples $m_{i}$ . Here, we discuss the roles of these factors.

For the effect of PPD, we find that when $\epsilon$ is small, the bound will decrease with an increase in $\epsilon$ . However, when $\epsilon$ is sufficiently large, it has limited influence on the bound. In particular, by taking $\epsilon\rightarrow\infty$ , the bound reduces to that for the optimal solution of Problem 1, where $(e^{\epsilon}+1)/(e^{\epsilon}-1)$ goes to 1 in (37). Note that $\mathrm{Rad}_{i}(\mathcal{W})$ and $m_{i}$ still remain and affect the performance.

For the effect of $\mathrm{Rad}_{i}(\mathcal{W})$ , we observe that the generalization errors of trained classifiers may become larger when $\mathrm{Rad}_{i}(\mathcal{W})$ increases. The Rademacher complexity is directly related to the size of the classifier class $\mathcal{W}$ . If there are only a small number of candidate classifiers in $\mathcal{W}$ , the solutions have a high probability of obtaining smaller deviation between their generalization errors and the reference generalization error $J_{\mathcal{P}}(\mathbf{w}^{\star})$ . Nevertheless, we should guarantee the richness of the class $\mathcal{W}$ to make $J_{\mathcal{P}}(\mathbf{w}^{\star})$ small since $\mathbf{w}^{\star}$ trained in terms of $\mathcal{W}$ will have large generalization error. Though the deviation $\Delta J_{\mathcal{P}}(\cdot)$ may be small, the trained classifiers are not good predictors due to the bad performance of $\mathbf{w}^{\star}$ . Thus, setting an appropriate classifier class is important for obtaining a classifier with qualified performance.

Finally, we consider the effect of the number of users. From the bound of $\Delta J_{\mathcal{P}}(\widehat{\mathbf{w}}_{\mathrm{opt}})$ in (37), we know that if $m_{i}$ becomes larger, the last term of the bound will decrease. In general, more data samples imply access to more information about the underlying distribution $\mathcal{P}$ . Then, the trained classifier can predict the labels of newly sampled data from $\mathcal{P}$ with higher accuracy. Moreover, it can be seen that the bound is the average of $n$ local errors generated in different servers. When new servers participate in the DML framework, these servers should make sure that they have collected sufficient amount of training data samples. Otherwise, the bound may not decrease though the total number of data samples increases. This is because unbalanced local errors may lead to an increase in their average, implying larger bound of $\Delta J_{\mathcal{P}}(\cdot)$ .

VI Experimental Evaluation

In this section, we conduct experiments to validate the obtained theoretical results and study the classification performance of the proposed PDML framework. Specifically, we first use a real-world dataset to verify the convergence property of the PDML framework and study how key parameters would affect the performance. Also, we leverage another seven datasets to verify the classification accuracy of the classifiers trained by the framework.

VI-A Experiment Setup

VI-A1 Datasets

We use two kinds of publicly available datasets as described below to validate the convergence property and classification accuracy of the PDML.

(i) Adult dataset [41]. The dataset contains census data of 48,842 individuals, where there are 14 attributes (e.g., age, work-class, education, occupation and native-country) and a label indicating whether a person’s annual income is over $50,000. After removing the instances with missing values, we obtain a training dataset with 45,222 samples. To preprocess the dataset, we adopt unary encoding approach to transform the categorial attributes into binary vectors, and further normalize the whole feature vector to be a vector with maximum norm of 1. The preprocessed feature vector is a 105-dimensional vector. For the labels, we mark the annual income over$ 50,000 as 1, otherwise it is labeled as $-1$ .

(ii) Gunnar Rätsch’s benchmark datasets [42]. There are thirteen data subsets from the UCI repository in the benchmark datasets. To mitigate the effect of data quality, we select seven datasets with the largest data sizes to conduct experiments. The seven datasets are German, Image, Ringnorm, Banana, Splice, Twonorm and Waveform, where the numbers of instances are 1,000, 2,086, 7,400, 5,300, 2,991, 7,400 and 5,000, respectively. Each dataset is partitioned into training and test data, with a ratio of approximately $70\%:30\%$ .

VI-A2 Underlying classification approach

Logistic regression (LR) is utilized for training the prediction model, where the loss function and regularizer are $\ell_{LR}(y_{i,j},\mathbf{w}_{i}^{\mathrm{T}}\mathbf{x}_{i,j})=\log\bigl{(}1+e^{-y_{i,j}\mathbf{w}_{i}^{\mathrm{T}}\mathbf{x}_{i,j}}\bigr{)}$ and $N(\mathbf{w}_{i})=\frac{1}{2}\|\mathbf{w}_{i}\|^{2}$ , respectively. Then, the local objective function is given by

[TABLE]

It is easy to check that when the classifier class $\mathcal{W}$ is bounded (e.g., a bounded set $\mathcal{W}=\{\mathbf{w}\in\mathbb{R}^{d}\;|\;\|\mathbf{w}\|\leq W\}$ ), $\ell_{LR}(\cdot)$ satisfies Assumption 1. Due to the convexity property of $N(\mathbf{w}_{i})$ , $J_{i}(\mathbf{w}_{i})$ is strongly convex. Then, according to Lemma 2, Problems 2 and 3 have optimal solution sets, and thus, we can use LR to train the classifiers.

VI-A3 Network topology

We consider $n=10$ servers collaboratively train a prediction model. A connected random graph is used to describe the communication topology of the 10 servers. The used graph has $E=13$ communication links in total. Each server is responsible for collecting the data from a group of users, and thus there are 10 groups of users. In the experiments, we assume that each group has the same number of users, that is, $m_{i}=m_{l},\forall i,l$ . For example, we use $m=45,000$ instances sampled from the Adult dataset to train the classifier, and then each server collects data from $m_{i}=4,500$ users.

VI-B Experimental Results with Adult Dataset

Based on the Adult dataset, we first verify the convergence property of the PDML framework. Fig. 3(a) illustrates the maximum distances between the norms of arbitrary two classifiers found by different servers. We set the bound of $\boldsymbol{\eta}_{i}$ to 1. Other settings are the same as those with experiments under the synthetic dataset. For the sake of comparison, we also draw the variation curve (with circle markers) of the maximum distance when the privacy-preserving approach in Phase 2 is removed. We observe that both distances converge to 0, implying that the consensus constraint is eventually satisfied.

Fig. 3(b) shows the variation of empirical risks (the objective function in (1)) as iterations proceed. Here, the green dashed line depicts the final empirical risk achieved by general ADMM with original data, which we call the reference empirical risk. There are also two curves showing varying empirical risks with privacy preservation. Comparing the two curves, we find that the ADMM with combined noise-adding scheme preserves the convergence property of the general ADMM algorithm. Due to the noise perturbations in Phase 2, the convergence time becomes longer. In addition, it can be seen that regardless of whether the privacy-preserving approach in Phase 2 is used, both ADMM schemes cannot achieve the same final empirical risks with that of the green line, which is consistent with the analysis in Section V-D.

We then study the effects of the key parameters on the performance. In Fig. 4(a), we examine the impact of the noise bound $R$ when the decaying rate $\rho$ is fixed at $0.8$ . It is observed that $R$ affects the final empirical risks of the trained classifiers. The larger the noise bound, the greater the gap between the achieved empirical risks and the reference value, which is reconciled with Corollary 1. In Fig. 4(b), we inspect the effect of Gaussian noise decaying rate $\rho$ when $R$ is fixed at $1$ . We find that the convergence time is affected by $\rho$ . A larger $\rho$ implies that the communicated classifiers are still perturbed by noises with larger variance even after iterating over multiple steps. Thus, more iterations are needed to obtain the same final empirical risk with that of smaller $\rho$ . Such a property can be derived from the bound in (27).

Fig. 4(c) illustrates the variation of final empirical risks when the PPD $\epsilon$ changes. The final empirical risks decrease with larger PPD (weaker privacy guarantee), which implies the tradeoff relation between the privacy protection and the performance. Further, the extra perturbations in Phase 2 lead to larger empirical risks for all the PPDs in the experiments. We also find that when $\epsilon$ is large ( $\epsilon>0.6$ ), the achieved empirical risks are close to the reference value, and do not significantly change. Again, the result is consistent with the analysis of the bound in (37).

VI-C Classification Accuracy Evaluation

We use the test data of the seven datasets to evaluate the prediction performance of the trained classifiers, which is shown in Table I. The classification accuracy is defined as the ratio that the labels predicted by the trained classifier match the true labels of test data. For comparison, we present the classification accuracy achieved by general ADMM with the original data. For validation of classification accuracy under the PDML framework, we choose six different sets of parameter configurations to conduct the experiments. The specific configurations can be found in the second row of Table I. We find that lager $\epsilon$ and smaller $R$ will generate better accuracy. According to the theoretical results, the upper bounds for the differences $\Delta J_{\mathcal{P}}(\widetilde{\mathbf{w}}_{\mathrm{opt}})$ and $\Delta J_{\mathcal{P}}(\widehat{\mathbf{w}}_{\mathrm{opt}})$ will decrease with lager $\epsilon$ and smaller $R$ , implying better performance of the trained classifiers. Thus, the bound in Theorem 1 also provides a guideline to choose appropriate parameters to obtain a prediction model with satisfied classification accuracy.

It is impressive to observe that even under the strongest privacy setting ( $\epsilon=0.4$ , $R=9$ ), the proposed framework achieves comparable classification accuracy to the reference precision. We also notice that under the datasets Banana and Splice, PDML achieves inferior accuracy in all settings. For a binary classification problem, it is meaningless to obtain a precision of around $50\%$ . The reason for the poor accuracy may be that LR is not a suitable classification approach for these two datasets. Overall, the proposed PDML framework achieves competitive classification accuracy on the basis of providing strong privacy protection.

VII Conclusion

In this paper, we have provided a privacy-preserving ADMM-based distributed machine learning framework. By a local randomization approach, data contributors obtain self-controlled DP protection for the most sensitive labels and the privacy guarantee will not decrease as ADMM iterations proceed. Further, a combined noise-adding method has been designed for perturbing the ADMM algorithm, which simultaneously preserves privacy for users’ feature vectors and strengthens protection for the labels. Lastly, the performance of the proposed PDML framework has been analyzed in theory and validated by extensive experiments.

For future investigations, we will study the joint privacy-preserving effects of the local randomization approach and the combined noise-adding method. Moreover, it is interesting while challenging to extend the PDML framework to the non-empirical risk minimization problems. When users allocate distinct sensitive levels to different attributes, we are interested in designing a new privacy-aware scheme providing heterogeneous privacy protections for different attributes.

-A Proof of Proposition 1

Let $\mathbf{d}^{\prime}=(\mathbf{x},y^{\prime})$ be the reported data of a user with arbitrary data sample $\mathbf{d}=(\mathbf{x},y)$ drawn from $\mathcal{P}$ . Then we have $\mathbf{d}^{\prime}=M(\mathbf{d})$ . Suppose that the user’s data sample has label $y_{1}=1$ , which is denoted by $\mathbf{d}_{1}=(\mathbf{x},1)$ . By (12) and (13), the probability that the user reports $y_{1}$ to the server is

[TABLE]

Similarly, if the user’s original label is $y_{2}=-1$ , i.e., $\mathbf{d}_{2}=(\mathbf{x},-1)$ , we have

[TABLE]

Then, we further have the relations as follows:

[TABLE]

With a slight abuse of notation, we view label “ $-1$ ” as “0” below. Note that under this case, the observation set $\mathcal{O}$ in Definition 1 is the user’s reported data $\mathbf{d}^{\prime}$ . Then, for any $\mathbf{d}^{\prime}$ with feature vector $\mathbf{x}$ and arbitrary label $y^{\prime}$ , we have

[TABLE]

where we use the relation $p\in(0,\frac{1}{2})$ . ∎

-B Proof of Proposition 2

(i) According to (12), we have

[TABLE]

where we have used Proposition 1. Then, it follows that

[TABLE]

By (19), we obtain

[TABLE]

Substituting (39) into (38), we arrive at

[TABLE]

(ii) The derivative of $\hat{\ell}(y^{\prime}_{i,j},\mathbf{w}_{i}^{\mathrm{T}}\mathbf{x}_{i,j},\epsilon)$ with respect to $\mathbf{w}_{i}$ is given by

[TABLE]

Then, we have

[TABLE]

This bound gives the Lipschitz constant of $\hat{\ell}(y^{\prime}_{i,j},\mathbf{w}_{i}^{\mathrm{T}}\mathbf{x}_{i,j},\epsilon)$ . ∎

-C Proof of Lemma 1

According to Assumption 2, $N(\cdot)$ is doubly differentiable. By Taylor’s Theorem, we have

[TABLE]

where $\nabla N(\cdot)$ and $\nabla^{2}N(\cdot)$ denote the gradient and the second-order gradient, respectively. Due to (2), we derive

[TABLE]

which implies $\nabla^{2}N(\mathbf{w})\geq\kappa$ . For $\forall i$ , let $f(\mathbf{w}_{i}):=\widehat{J}_{i}(\mathbf{w}_{i})-\frac{a\kappa}{2n}\|\mathbf{w}_{i}\|_{2}^{2}$ , and then we have

[TABLE]

where we have used Assumption 1. This relation also implies that $f(\mathbf{w}_{i})$ is convex. Then, we obtain $\forall\mathbf{w}_{1},\mathbf{w}_{2}\in\mathcal{W}$ , $f(\mathbf{w}_{1})\geq f(\mathbf{w}_{2})+\nabla f(\mathbf{w}_{2})^{\mathrm{T}}(\mathbf{w}_{1}-\mathbf{w}_{2})$ . It follows that

[TABLE]

Rearrange the above equation so that

[TABLE]

which indicates $\widehat{J}_{i}(\mathbf{w}_{i})$ is $\frac{a\kappa}{n}$ -strongly convex. Since $\widehat{J}(\{\mathbf{w}_{i}\}_{i\in\mathcal{S}})=\sum_{i=1}^{n}\widehat{J}_{i}(\mathbf{w}_{i})$ , it follows that $\widehat{J}(\{\mathbf{w}_{i}\}_{i\in\mathcal{S}})$ is $a\kappa$ -strongly convex. ∎

-D Proof of Lemma 3

The strongly convex property of $\widetilde{J}(\{\mathbf{w}_{i}\}_{i\in\mathcal{S}})$ can be proved directly from Lemma 1. For the Lipschitz continuous gradient, we consider the compact form of classifiers, as $\mathbf{w}=[\mathbf{w}_{1}^{\mathrm{T}}\cdots\mathbf{w}_{n}^{\mathrm{T}}]^{\mathrm{T}}$ . We have $\widetilde{J}(\mathbf{w})=\widetilde{J}(\{\mathbf{w}_{i}\}_{i\in\mathcal{S}})$ . The second derivative of $\widetilde{J}(\mathbf{w})$ with respect to $\mathbf{w}$ is given by

[TABLE]

For $\frac{\partial^{2}\hat{\ell}(y^{\prime}_{i,j},\mathbf{w}_{i}^{\mathrm{T}}\mathbf{x}_{i,j},\epsilon)}{\partial\mathbf{w}_{i}^{2}}$ , we have

[TABLE]

Due to $\|\nabla^{2}N(\cdot)\|_{2}\leq\varrho$ , we derive

[TABLE]

This also gives the Lipschitz continuous gradient of $\widetilde{J}(\mathbf{w})$ . ∎

Bibliography42

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. Fredrikson, E. Lantz, S. Jha, S. Lin, D. Page, and T. Ristenpart, “Privacy in pharmacogenetics: An end-to-end case study of personalized warfarin dosing,” in Proc. USENIX Secur. , 2014, pp. 17–32.
2[2] Q. Le and T. Mikolov, “Distributed representations of sentences and documents,” in Proc. IMLS ICML , 2014, pp. 1188–1196.
3[3] H. Wang, N. Wang, and D.-Y. Yeung, “Collaborative deep learning for recommender systems,” in Proc. ACM SIGKDD , 2015, pp. 1235–1244.
4[4] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein et al. , “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Found. Trends Mach. Learn. , vol. 3, no. 1, pp. 1–122, 2011.
5[5] Y. Qi, O. Tastan, J. G. Carbonell, J. Klein-Seetharaman, and J. Weston, “Semi-supervised multi-task learning for predicting interactions between HIV-1 and human proteins,” Bioinformatics , vol. 26, no. 18, pp. 645–652, 2010.
6[6] C. Dwork, “Differential privacy: A survey of results,” in Proc. Int. Conf. Theor. Appl. Mod. Comput. , 2008, pp. 1–19.
7[7] X. Wang, J. He, P. Cheng, and J. Chen, “Privacy preserving collaborative computing: Heterogeneous privacy guarantee and efficient incentive mechanism,” IEEE Trans. Signal Proces. , vol. 67, no. 1, pp. 221–233, 2019.
8[8] E. Nozari, P. Tallapragada, and J. Cortés, “Differentially private average consensus: Obstructions, trade-offs, and optimal algorithm design,” Automatica , vol. 81, pp. 221–231, 2017.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Privacy-preserving Distributed Machine Learning via Local Randomization and ADMM Perturbation

Abstract

Index Terms:

I Introduction

II Related Works

III Preliminaries and Problem Statement

III-A System Setting

III-B Classification Problem and ADMM Algorithm

Assumption 1**.**

Assumption 2**.**

Problem 1**.**

III-C Privacy-preserving Problem

Definition 1**.**

III-D System Overview

IV Privacy-Preserving Framework Design

IV-A Privacy-Preserving Approach in Phase 1

Definition 2**.**

Proposition 1**.**

IV-B Privacy-Preserving Approach in Phase 2

IV-C Discussions

Remark 1**.**

Remark 2**.**

V Performance Analysis

V-A Performance Metric

V-B Modified Loss Function in ADMM Algorithm

Proposition 2**.**

Lemma 1**.**

Problem 2**.**

Lemma 2**.**

Problem 3**.**

Lemma 3**.**

V-C Generalization Error Analysis

Lemma 4**.**

Lemma 5**.**

Theorem 1**.**

Proof.

V-D Comparisons and Discussions

Corollary 1**.**

Theorem 2**.**

Corollary 2**.**

VI Experimental Evaluation

VI-A Experiment Setup

VI-A1 Datasets

VI-A2 Underlying classification approach

VI-A3 Network topology

VI-B Experimental Results with Adult Dataset

VI-C Classification Accuracy Evaluation

VII Conclusion

-A Proof of Proposition 1

-B Proof of Proposition 2

-C Proof of Lemma 1

-D Proof of Lemma 3

Assumption 1.

Assumption 2.

Problem 1.

Definition 1.

Definition 2.

Proposition 1.

Remark 1.

Remark 2.

Proposition 2.

Lemma 1.

Problem 2.

Lemma 2.

Problem 3.

Lemma 3.

Lemma 4.

Lemma 5.

Theorem 1.

Corollary 1.

Theorem 2.

Corollary 2.