Disparate Vulnerability to Membership Inference Attacks

Bogdan Kulynych; Mohammad Yaghini; Giovanni Cherubin; Michael Veale,; Carmela Troncoso

arXiv:1906.00389·cs.LG·September 20, 2021

Disparate Vulnerability to Membership Inference Attacks

Bogdan Kulynych, Mohammad Yaghini, Giovanni Cherubin, Michael Veale,, Carmela Troncoso

PDF

2 Repos

TL;DR

This paper investigates the unequal susceptibility of different population groups to membership inference attacks, establishing conditions for prevention, analyzing connections to fairness and privacy, and providing a framework for reliable assessment.

Contribution

It offers a theoretical framework for understanding and measuring disparate vulnerability to MIAs, linking it to fairness and privacy, and presents experimental evidence of such disparities.

Findings

01

Disparate vulnerability exists in realistic settings.

02

Fairness alone cannot prevent all disparities.

03

Differential privacy reduces but does not eliminate vulnerability.

Abstract

A membership inference attack (MIA) against a machine-learning model enables an attacker to determine whether a given data record was part of the model's training data or not. In this paper, we provide an in-depth study of the phenomenon of disparate vulnerability against MIAs: unequal success rate of MIAs against different population subgroups. We first establish necessary and sufficient conditions for MIAs to be prevented, both on average and for population subgroups, using a notion of distributional generalization. Second, we derive connections of disparate vulnerability to algorithmic fairness and to differential privacy. We show that fairness can only prevent disparate vulnerability against limited classes of adversaries. Differential privacy bounds disparate vulnerability but can significantly reduce the accuracy of the model. We show that estimating disparate vulnerability to…

Tables6

Table 1. Table 1: Subgroup representation in the datasets.

Dataset	$z$	Size
\@BTrule[]adult	“White” (WH)	38,903
	“Black” (BL)	4,228
	“Asian-Pac-Islander” (AI)	1,303
	“Amer-Indian-Eskimo” (AE)	435
	“Other” (OT)	353
	All	48,842
\@BTrule[]texas-50K	4	31,514
	5	10,883
	3	6,451
	2	1,019
	1	133
	All	50,000

Table 2. Table 2: Summary of models performance and vulnerability on adult and texas-50K . Columns: Disparity test: p 𝑝 p -value of the anova F-test that checks if any of the subgroups have differing subgroup vulnerabilities, Test acc.: Test accuracy of models, Gen. gap: Per-model difference between train accuracy and test accuracy, Vuln.: Aggregate vulnerability V ( 𝒜 ) 𝑉 𝒜 V(\mathcal{A}) . Bold font indicates models that have statistically significant disparity ( p < 0.01 ) 𝑝 0.01 (p<0.01) .

adult	Disparity test	Test acc.		Gen. gap		Vuln., %
	$p$	avg	std	avg	std	avg	std
Model
\@BTrule[]Logistic Regression (LR)	0.3230	0.8404	0.0018	0.0012	0.0034	0.0942	0.4093
8-Neuron NN	0.0000	0.8421	0.0018	0.0044	0.0033	0.4052	0.3927
32-Neuron NN	0.0000	0.8410	0.0019	0.0131	0.0033	1.1373	0.4178
DP LR, $ε = 1$	0.8534	0.7797	0.0135	0.0006	0.0040	0.0830	0.3478
DP LR, $ε = 2$	0.0500	0.8053	0.0076	0.0004	0.0036	0.0563	0.3360
DP LR, $ε = 10$	0.0419	0.8321	0.0023	0.0011	0.0032	0.0888	0.4100
Fair LR (Dem. Parity)	0.8945	0.8267	0.0018	0.0011	0.0035	0.0980	0.3331
Fair LR (Equalized Odds)	0.7089	0.7941	0.0095	0.0006	0.0038	0.0782	0.3521

Table 3. Table 3: Results of post-hoc tests on adult models. Columns: z 𝑧 z and z ′ superscript 𝑧 ′ z^{\prime} : identifiers of subgroups, t 𝑡 t : value of the t statistic, p 𝑝 p : uncorrected p-value, p 𝑝 p -corr.: p-value after the correction for multiple comparisons.

NN-8	$z$	$z^{'}$	$t$	$p$	$p$ -corr.
\@BTrule[]0	AE	AI	-4.4298	0.0000	0.0001
1	AE	BL	0.5143	0.6076	0.6751
2	AE	OT	-1.7468	0.0822	0.1174
3	AE	WH	0.0498	0.9604	0.9604
4	AI	BL	8.8677	0.0000	0.0000
5	AI	OT	1.8976	0.0592	0.0987
6	AI	WH	8.9236	0.0000	0.0000
7	BL	OT	-2.6402	0.0089	0.0224
8	BL	WH	-1.3443	0.1804	0.2255
9	OT	WH	2.3290	0.0209	0.0417

Table 4. Table 4: Results of post-hoc tests on texas-50K models. See Table 3 caption for details.

NN-32	$z$	$z^{'}$	$t$	$p$	$p$ -corr.
\@BTrule[]0	1	2	-3.4973	0.0006	0.0007
1	1	3	0.2056	0.8374	0.8374
2	1	4	4.2820	0.0000	0.0000
3	1	5	3.0576	0.0025	0.0028
4	2	3	10.0174	0.0000	0.0000
5	2	4	21.2727	0.0000	0.0000
6	2	5	17.4069	0.0000	0.0000
7	3	4	21.8804	0.0000	0.0000
8	3	5	13.2434	0.0000	0.0000
9	4	5	-8.1600	0.0000	0.0000

Table 5. Table 5: Results on adult , disaggregated by subgroups, for models with disparity F-test p < 0.01 𝑝 0.01 p<0.01 .

		Test acc.		Gen. gap		Subgroup vuln.
		avg	std	avg	std	avg	std
Model	$z$
\@BTrule[]32-Neuron NN	Amer-Indian-Eskimo	0.9028	0.0139	0.0115	0.0253	1.1701	4.8259
	Asian-Pac-Islander	0.8165	0.0119	0.0693	0.0195	5.7713	2.6300
	Black	0.9043	0.0049	0.0138	0.0086	0.8200	1.6261
	Other	0.8881	0.0179	0.0492	0.0295	3.2550	5.1807
	White	0.8338	0.0021	0.0109	0.0035	0.9773	0.4496
8-Neuron NN	Amer-Indian-Eskimo	0.9042	0.0151	0.0041	0.0281	0.3701	4.7177
	Asian-Pac-Islander	0.8264	0.0119	0.0223	0.0214	2.1320	2.7965
	Black	0.9066	0.0047	0.0035	0.0093	0.1878	1.6152
	Other	0.8913	0.0165	0.0149	0.0309	1.2805	5.6344
	White	0.8345	0.0020	0.0039	0.0036	0.3535	0.4314

Table 6. Table 6: Results on texas-50K , disaggregated by subgroups, for models with disparity F-test p < 0.01 𝑝 0.01 p<0.01 .

		Test acc.		Gen. gap		Subgroup vuln.
		avg	std	avg	std	avg	std
Model	$z$
\@BTrule[]32-Neuron NN	1	0.8699	0.0380	0.0791	0.0451	8.5188	8.2829
	2	0.8644	0.0153	0.1013	0.0180	10.7429	3.0129
	3	0.8498	0.0085	0.0855	0.0106	8.3947	1.6121
	4	0.8644	0.0066	0.0637	0.0063	6.0331	0.8261
	5	0.8708	0.0063	0.0697	0.0074	6.7288	1.0840
Fair LR (Dem. Parity)	1	0.6932	0.0562	-0.0010	0.0839	0.0075	8.9200
	2	0.6934	0.0203	0.0095	0.0295	0.8381	2.9201
	3	0.7323	0.0084	0.0143	0.0099	0.7667	1.1361
	4	0.7771	0.0027	0.0155	0.0048	1.5751	0.4952
	5	0.7384	0.0068	0.0106	0.0088	0.5997	0.8448

Equations165

A (x, A_{S}, n, D) ≜ Att_{A, n, D} \circ ϕ (A_{S}, x)

A (x, A_{S}, n, D) ≜ Att_{A, n, D} \circ ϕ (A_{S}, x)

V (A)

V (A)

V_{z} (A) ≜ 2 Pr [MIA (A, A, n, D) = 1 ∣ Z = z] - 1.

V_{z} (A) ≜ 2 Pr [MIA (A, A, n, D) = 1 ∣ Z = z] - 1.

Att_{W} : W \mapsto {0, 1} max V (Att_{W} \circ ϕ_{W}),

Att_{W} : W \mapsto {0, 1} max V (Att_{W} \circ ϕ_{W}),

Att_{W}^{*} (w)

Att_{W}^{*} (w)

μ_{1}^{π} (T)

μ_{1}^{π} (T)

μ_{0}^{π} (T)

R({\pi},{d})\triangleq d\big{(}\mu^{\pi}_{1},\ \mu^{\pi}_{0}\big{)},

R({\pi},{d})\triangleq d\big{(}\mu^{\pi}_{1},\ \mu^{\pi}_{0}\big{)},

R

R

d_{MD} (μ, μ^{'}) ≜ \int ω d μ (ω) - \int ω d μ^{'} (ω),

d_{MD} (μ, μ^{'}) ≜ \int ω d μ (ω) - \int ω d μ^{'} (ω),

V (A_{W}^{*})

V (A_{W}^{*})

d_{TV} (μ, μ^{'}) ≜ T \subseteq W sup ∣ μ (T) - μ^{'} (T) ∣

d_{TV} (μ, μ^{'}) ≜ T \subseteq W sup ∣ μ (T) - μ^{'} (T) ∣

L^{*} ≜ Pr [Att^{*} (W) \neq = M]

L^{*} ≜ Pr [Att^{*} (W) \neq = M]

V (A_{W}) ≜ 2 Pr [Att (W) = M] - 1

V (A_{W}) ≜ 2 Pr [Att (W) = M] - 1

V (A_{W}^{*}) = 2 (1 - Pr [Att^{*} (W) \neq = M]) - 1 = 1 - 2 L^{*} .

V (A_{W}^{*}) = 2 (1 - Pr [Att^{*} (W) \neq = M]) - 1 = 1 - 2 L^{*} .

L^{*}

L^{*}

= \frac{1}{2} - \frac{1}{2} d_{TV} S \sim D^{n} x \sim S Pr [ϕ_{W} (A_{S}, x)], S \sim D^{n} x \sim D Pr [ϕ_{W} (A_{S}, x)]

= \frac{1}{2} - \frac{1}{2} d_{TV} (μ_{1}^{ϕ_{W}}, μ_{0}^{ϕ_{W}}),

R (ϕ_{\hat{Y}}, d_{TV})

R (ϕ_{\hat{Y}}, d_{TV})

\displaystyle=\frac{1}{2}\int\big{|}f_{1}(\smash{\hat{y}})-f_{0}(\smash{\hat{y}})\big{|}\mathop{}\!\mathrm{d}\smash{\hat{y}},

V (A_{ℓ (A_{S}, X)}^{*})

V (A_{ℓ (A_{S}, X)}^{*})

R (ℓ, d_{TV})

R (ℓ, d_{TV})

- Pr [ℓ (A_{S}, X) = 1 ∣ M = 0] ∣

= ∣ E [ℓ (A_{S}, X) ∣ M = 1]

- E [ℓ (A_{S}, X) ∣ M = 0)] ∣

= ∣ R (ℓ, d_{MD}) ∣.

μ_{1, z}^{π} (T)

μ_{1, z}^{π} (T)

μ_{0, z}^{π} (T)

R_{z}({\pi},{d})\triangleq d\big{(}\mu^{\pi}_{1,z},\ \mu^{\pi}_{0,z}\big{)},

R_{z}({\pi},{d})\triangleq d\big{(}\mu^{\pi}_{1,z},\ \mu^{\pi}_{0,z}\big{)},

V_{z} (A_{W}^{*}) \leq R_{z} (ϕ_{W}, d_{TV})

V_{z} (A_{W}^{*}) \leq R_{z} (ϕ_{W}, d_{TV})

V_{z} (A_{W, Z}^{*}) = R_{z} (ϕ_{W}, d_{TV})

V_{z} (A_{W, Z}^{*}) = R_{z} (ϕ_{W}, d_{TV})

Δ V_{z, z^{'}} (A_{W}^{*}) ≜ V_{z} (A_{W}^{*}) - V_{z^{'}}^{*} (A_{W}^{*}) .

Δ V_{z, z^{'}} (A_{W}^{*}) ≜ V_{z} (A_{W}^{*}) - V_{z^{'}}^{*} (A_{W}^{*}) .

\big{|}\Delta V_{z,z^{\prime}}(\mathcal{A}^{*}_{W})\big{|}\leq\max\{R_{z}({\phi_{W}},{d_{\mathrm{TV}}}),R_{z^{\prime}}({\phi_{W}},{d_{\mathrm{TV}}})\}

\big{|}\Delta V_{z,z^{\prime}}(\mathcal{A}^{*}_{W})\big{|}\leq\max\{R_{z}({\phi_{W}},{d_{\mathrm{TV}}}),R_{z^{\prime}}({\phi_{W}},{d_{\mathrm{TV}}})\}

Δ V_{z, z^{'}} (A_{W, Z}^{*})

Δ V_{z, z^{'}} (A_{W, Z}^{*})

\hat{V} (A) ≜ \frac{1}{r} i = 1 \sum r v_{i}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\createprocedureblock

procb center, boxed linenumbering

Disparate Vulnerability to Membership Inference Attacks

Bogdan Kulynych

EPFL

Mohammad Yaghini

University of Toronto, Vector Institute

Giovanni Cherubin

Alan Turing Institute

Michael Veale

University College London

Carmela Troncoso

EPFL

Abstract

A membership inference attack (MIA) against a machine-learning model enables an attacker to determine whether a given data record was part of the model’s training data or not. In this paper, we provide an in-depth study of the phenomenon of disparate vulnerability against MIAs: unequal success rate of MIAs against different population subgroups. We first establish necessary and sufficient conditions for MIAs to be prevented, both on average and for population subgroups, using a notion of distributional generalization. Second, we derive connections of disparate vulnerability to algorithmic fairness and to differential privacy. We show that fairness can only prevent disparate vulnerability against limited classes of adversaries. Differential privacy bounds disparate vulnerability but can significantly reduce the accuracy of the model. We show that estimating disparate vulnerability to MIAs by naïvely applying existing attacks can lead to overestimation. We then establish which attacks are suitable for estimating disparate vulnerability, and provide a statistical framework for doing so reliably. We conduct experiments on synthetic and real-world data finding statistically significant evidence of disparate vulnerability in realistic settings.

1 Introduction

Membership Inference Attacks (MIAs), in which an adversary aims to determine whether an example is part of the training set, are one of the main privacy attacks against machine-learning (ML) models. Since they were first described [39], many works have studied the potential of these attacks under diverse circumstances [33, 30, 22, 24, 32, 29]; and the causes and limits of these attacks [16, 26, 43]. In both empirical and theoretical approaches researchers focus on the average MIA success across the records. However, there is empirical evidence that the vulnerability to MIAs is not always evenly distributed: it can differ across target classes [39], it can be more effective against some individuals [29], and it can vary across subgroups [6]. These results imply that average-based studies can overestimate the privacy for some individuals [15].

In this paper, we provide the first theoretical analysis of the disparate vulnerability to MIA across populations subgroups. Our contributions are the following:

✓

We introduce a novel characterization of the vulnerability to MIAs, which provides a necessary and sufficient condition for these attacks to succeed: lack of distributional generalization. Vulnerability to MIA arises when the distribution of a model’s property (e.g., loss, or outputs) is different for samples in and out of the training dataset. This result complements previous studies that demonstrated the lack of standard generalization (i.e., overfitting) to be a sufficient but not necessary condition for vulnerability to MIAs [29, 43].

✓

We introduce the first formal analysis of disparate vulnerability and extend our results on necessary and sufficient conditions for preventing MIAs to subgroups.

✓

We show that estimating the magnitude of the disparate vulnerability is non-trivial when subgroups are small. We provide a statistical framework and methods to estimate disparate vulnerability and its significance. We show that not all vulnerability estimation mechanisms used in prior work are adequate for subgroups. We discuss the implications of these difficulties for regulation compliance.

✓

We prove that satisfying algorithmic-fairness constraints can decrease disparate vulnerability to limited classes of attackers. We also show that training with differential privacy bounds the magnitude of the disparate vulnerability.

✓

We empirically evaluate disparate vulnerability both on synthetic and on real-world datasets, demonstrating that disparate vulnerability exists in realistic models, with high statistical significance.

✓

We discuss the importance of disagreggating privacy measurements when evaluating the legal implications of privacy attacks. In particular, the importance of studying the consequences of privacy attacks for subgroups when analyzing the privacy risks of a deployment, as opposed to studying individual privacy risks [29] that can be dismissed as residual and acceptable.

2 Related work

2.0.0.1 Theory studies on MIA

Yeom et al. studied the relation of MIAs to overfitting [43]; in their work, they formalize MIA as an indistinguishability game, which we adapt to construct our theoretical framework. Farokhi et al. analyzed the dependence of MIA’s success on the amount of information the model memorizes [16], and Jayaraman et al. investigated their dependence on the prior probability that the example given to the adversary is a member or non-member of the training set [22]. Yeom et al. [43], and Cherubin et al. [9] showed that MIAs success is bounded by DP. Humphries et al. [21] showed these bounds only apply so long as the training data are i.i.d.-sampled. All these analyses, however, are only meaningful for the average-case MIA. A classifier thought to be secure according to these analyses may provide weaker protection to certain individuals or subpopulations.

Our framework complements these studies and generalizes the notion of MIA risk to subgroups of the population, thus enabling the study of vulnerability for subsets of the records’ labels, individuals, and subpopulations.

2.0.0.2 Disparity and machine learning

The work on disparity in machine learning is centered on understanding and mitigating disparate impact of algorithmic decisions on subpopulations [10, 2, 28]. Bagdasaryan et al. [1] and Pujol et al. [35] study disparity in accuracy under differential privacy (DP), and show that training with DP can increase disparate impact.

In this work, we develop a theory that supports the empirical evidence that disparate impact would also cause disparity in vulnerability to MIAs [6, 29, 39].

3 Membership Inference Attacks

Let $\Omega$ be a population of examples, where each example represents an individual: $x\in\Omega$ . We assume that the population is partitioned in disjoint subgroups. Each subgroup $G_{z}\subset\Omega$ is formed by examples that share one or several attributes (e.g., race or gender in the way they are commonly represented in data), such that $\bigcup_{z=1}^{t}G_{z}=\Omega$ . We consider a data-generating distribution $\mathcal{D}$ over $\Omega$ .

We indicate with $A(\cdot)$ the training algorithm that produces a model $A_{S}$ given training data $S\subset\Omega$ . The learning task for this model is to infer the value of the label $y=y(x)$ associated with an individual $x$ . We assume that the model can be either a regressor ( $y$ takes values in a set with total order, e.g. $\mathbb{R}$ ) or a classifier ( $y$ takes values in a finite set).

The goal of a membership inference attack (MIA) is to predict whether an example $x\in\Omega$ is a member or a non-member of the training set $S$ . We assume a threat model where a MIA adversary observes the target model’s behavior that relates to $x$ , and has information about the data distribution $\mathcal{D}$ , training-data sampling, and the training algorithm. We formalize MIAs using the indistinguishability game by Yeom et al. [43]:

In this game, the challenger samples $S$ from the population, and trains a model $A_{S}$ using training algorithm $A$ (line 1). The challenger then randomly draws a secret $m$ (line 2) whose value denotes $x$ ’s $m$ embership in $S$ : $m=1$ if the challenge example $x$ is sampled from the training set $S$ (line 4), and $m=0$ if it is sampled from the data distribution $\mathcal{D}$ (line 6). As Yeom et al. [43], we assume that the population is large enough that the chance of sampling a member $x\in S$ from $\mathcal{D}$ is negligible. Given the challenge example $x$ , the target model $A_{S}$ and its training algorithm $A(\cdot)$ , the sampling parameter $n$ , and the distribution of the training data $\mathcal{D}$ , the MIA adversary $\mathcal{A}(\cdot)$ makes a guess $\hat{m}$ about the example’s membership in $S$ (line 8). We use this formalization as it is the most common, although there are other ways to formalize MIAs [21].

The MIA game defines a joint probability distribution over training datasets $S$ , membership “coins” $m$ , and challenge examples $x$ . We denote by $M$ the random variable taking the value of the membership coin (line 2), by $X$ the challenge example, by $Y=y(X)$ the label associated with the challenge example $x$ , by $Z$ the subgroup of the population $z$ to which the $x$ belongs, and by $\smash{\hat{Y}}=A_{S}(X)$ the output the model $A_{S}\textbf{}$ at $x$ .

3.1 Attack strategy

As described in the MIA game, the adversary’s knowledge is limited to $(x,A_{S},n,\mathcal{D})$ , and their goal is to guess the membership of $x$ . For brevity, we use $A_{S}$ to indicate both the access to trained models $A_{S}$ and their training algorithm $A(\cdot)$ .

We define a general strategy to perform a membership attack that encompasses several instances of MIA, e.g., [39, 43, 32]. This strategy consists of two phases.

First, the adversary prepares an attack algorithm $\mathtt{Att}_{A,n,\mathcal{D}}(\cdot)$ which depends on the target training algorithm $A(\cdot)$ , and data-sampling parameters $n$ and $\mathcal{D}$ , e.g., by training a shadow-model attack classifier [39]. We drop the subscripts in $\mathtt{Att}_{A,n,\mathcal{D}}$ where the setting is clear from the context.

In the second phase, the adversary extracts features, $w\leftarrow\phi(A_{S},x)$ , describing the target model and the example, and applies the attack algorithm to the extracted features to obtain the membership guess, $\hat{m}\leftarrow\mathtt{Att}_{A,n,\mathcal{D}}(w)$ . Thus, the adversary’s guess $\hat{m}$ is obtained by applying the attack algorithm to the extracted features:

[TABLE]

This formalization is flexible: it captures both white-box and black-box adversarial models. For example, the features could be the outputs of the model and the example’s label $w=(A_{S}(x),y(x))$ [39], the model’s loss $\ell$ for the challenge example, $w=\ell(A_{S}(x),y(x))$ [43], or the model’s gradients as in some white-box attacks [32], etc.

We use random variable $W$ to indicate the extracted features $w$ across instances of the MIA game. For example, if the attacker uses the model’s output and the label as features [39], we denote them as $W=(\smash{\hat{Y}},Y)$ . With a slight abuse of notation, we use $\phi_{W}:(A_{S},x)\mapsto w$ to indicate the procedure that extracts features $w$ that are realizations of the $W$ random variable. Furthermore, we denote by $\mathcal{A}_{W}$ an adversary that uses features $W$ .

We distinguish two kinds of adversaries depending on the features they use: regular adversaries that do not use subgroup information ( $Z\notin W$ ), and subgroup-aware adversaries that do use this information ( $Z\in W$ ). We assume that the latter adversary can obtain the subgroup $z$ from the examples $x$ themselves, encoded in an example (e.g., gender, race). That is the case for our experiments on real-world data in Section 7. However, in practical scenarios, this knowledge could be encoded in the label $y(x)$ , or come from external sources. Prior work has mainly considered regular adversaries.

3.2 Vulnerability

We introduce the concept of vulnerability of an ML model to membership inference attacks (MIAs). Vulnerability measures the success of an adversary against the model. We also introduce worst-case (Bayes) vulnerability, i.e., vulnerability against an information-theoretically optimal adversary.

We measure the vulnerability to MIAs by the normalized advantage [43] of the adversary $\mathcal{A}$ over random guessing:

Definition 1.

We define vulnerability to adversary $\mathcal{A}$ as:

[TABLE]

The definition of vulnerability can be extended to subgroups:

Definition 2.

Let $z$ be a subgroup of the population. We define subgroup vulnerability to adversary $\mathcal{A}$ as:

[TABLE]

which captures the normalized advantage of a MIA adversary $\mathcal{A}$ for challenge examples coming from a given subgroup $z$ .

3.2.0.1 Optimal adversaries

We base our analysis on information-theoretically optimal adversaries. Consider the worst-case vulnerability to any attack that uses features $W$ :

[TABLE]

where ${\mathbb{W}}$ is the domain of $W$ .

The maximum in Eq. 2 is achieved by a Bayes adversary which uses the following strategy for the attack [9, 36]:

[TABLE]

We denote the Bayes adversary as $\mathcal{A}^{*}_{W}\triangleq\mathtt{Att}^{*}_{W}\circ\phi_{W}$ , and drop the subscripts where no ambiguity arises.

3.2.0.2 Subgroup-aware Bayes adversary

We assume the adversary knows the subgroup $z$ to which each example $x$ belongs. We refer to this adversary as subgroup-aware. As the vulnerability to the Bayes adversary grows if the adversary has more information about the examples, the worst-case vulnerability to a subgroup-aware (Bayes) adversary is equal or higher compared to a regular (Bayes) adversary:

Proposition 1.

$V(\mathcal{A}^{*}_{W,Z})\geq V(\mathcal{A}^{*}_{W})\,.$ **

We defer the proof to Appendix A.

In our experimental evaluations, we only consider subgroup-aware adversaries as they are guaranteed to attain higher advantage in the worst-case.

4 Distributional Generalization and Vulnerability to MIAs

An ML model is said to overfit, or poorly generalize, when its average loss on the training set differs from its loss on new samples from the population. Previous work showed that, while overfitting is an important factor for evaluating MIA [39], it is not necessary for MIA vulnerability [43, 29].

Fig. 1 illustrates with an example why the absence of standard overfitting does not, in general, prevent MIAs. The figure shows a model’s loss values on its training and test data. The standard, average-based definition of overfitting cannot distinguish between the two distributions; but an adversary potentially can, and the model can be vulnerable to MIAs.

4.1 Distributional Generalization

To establish the necessary and sufficient conditions for models to be vulnerable to MIAs, we introduce an extended notion of generalization that goes beyond comparing the average loss on train and test data. It covers the difference in the distributions of any given property of a model on the training data and outside. A property is any function that takes as input a model and an example: $\pi(A_{S},x)$ , and returns a numeric vector. A property function can be, for instance, a loss function, the gradient, or the prediction from the model.

We are interested in the distributions of properties on the examples $x$ coming from the training dataset and from outside of the training dataset. For any set $T$ from the range of $\pi$ , we define the corresponding probability measures as:

[TABLE]

Definition 3.

For any property function $\pi(A_{S},x)$ , we define the distributional-generalization gap as follows:

[TABLE]

where $d(\mu,\mu^{\prime})$ is a measure of dissimilarity between probability distributions.

This generic notion subsumes the standard notion of generalization. Standard generalization can be measured using the average-dataset generalization gap (see, e.g., in Yeom et al. [43]), the difference between the expected loss on the training dataset and the expected loss on the distribution:

[TABLE]

where $\ell(A_{S},x)$ is a loss function. We can recover this standard notion of a generalization gap as $R({\ell},{d_{\mathrm{MD}}})$ , using the loss function as the property function and the mean discrepancy function $d_{\mathrm{MD}}(\mu,\mu^{\prime})$ as a measure of dissimilarity:

[TABLE]

Whereas standard generalization quantifies how much the training algorithm tends to memorize the training dataset through the lens of its performance (loss), distributional generalization can do so (1) through the lens of other properties beyond losses, and (2) considering distributional information instead of only the difference between the means.

Evaluating distributional generalization enables us to assess the generalization of an ML model on the entire population, rather than on average. In Fig. 1 it is clear that the model’s actual loss across the entire population is concentrated on a few individuals. Distributional generalization enables us to capture this discrepancy, whereas standard generalization does not.

Concurrently, Nakkiran and Bansal [31] have also proposed a similar notion of distributional generalization. Our proposal allows for more general distances between distributions, whereas Nakkiran and Bansal, when translated to our terms, define the gap using the mean discrepancy, which is not sufficient for our analysis.

4.2 Relation between Worst-case Vulnerability and Distributional Generalization

The ability of any classifier to successfully distinguish between observations of two classes can be characterized by the total variation between the class-conditional distributions of observations. By applying this well-known fact to the worst-case MIA attackers, we can characterize vulnerability in terms of distributional generalization:

Proposition 2.

The worst-case vulnerability to MIAs with adversary’s features $W$ is equal to the distributional-generalization gap under total-variation distance:

[TABLE]

where the total-variation distance is defined as:

[TABLE]

According to Proposition 2, when the property function $\pi$ is the adversary’s feature extraction mechanism $\phi_{W}$ , the distributional-generalization gap is equal to the worst-case vulnerability to adversaries that use features $W=\phi_{W}(A_{S},X)$ .

Proof.

Let us define the Bayes error $L^{*}$ , the 0-1 classification error of the Bayes classifier. In the case of $\mathtt{Att}^{*}$ :

[TABLE]

Recall that vulnerability is defined through the success probability of an adversary:

[TABLE]

Thus, for a Bayes adversary, $V(\mathcal{A}^{*}_{W})$ uses the complement of the Bayes error $L^{*}$ :

[TABLE]

It is well-known that the the Bayes error of the binary classifier under uniform prior is equal to:

[TABLE]

See, e.g., Devroye et al. [12, Chapter 3.9]. This implies the sought form.

∎

This form is a straightforward consequence of our Bayes-optimal approach to vulnerability and is an application of a well-known result in statistical theory. It provides us with an intuitive interpretation of the worst-case vulnerability to MIAs—as it is equal to the distributional-generalization gap—thus with a guideline on how to prevent MIAs. The result holds for both white-box and black-box adversary models.

Let us visually illustrate distributional generalization and worst-case vulnerability. Consider adversarial features $W=\smash{\hat{Y}}$ . For the continuous property function $\phi_{\smash{\hat{Y}}}$ , the distributional-generalization gap becomes:

[TABLE]

where $f_{1}$ and $f_{0}$ are probability density functions associated with measures $\mu_{1}$ and $\mu_{0}$ , respectively. See Fig. 2 for a visualization. The worst-case vulnerability to adversaries using features $W=\smash{\hat{Y}}$ is the area between the densities of the “in” and “out” output distributions.

Note that the distance used in Proposition 2 is average-dataset. That is, when computing the features $\phi(A_{S},X)$ , the model $A_{S}$ is a random variable over the randomness of $A(\cdot)$ and $S\sim\mathcal{D}^{n}$ . To train models with minimal vulnerability to MIAs, Li et al. [27] used a similar yet different notion of distance, the distance between outputs of a fixed model on its training dataset and a validation dataset. Although conceptually similar, such distance cannot be directly used to evaluate the worst-case vulnerability using Proposition 2.

4.2.0.1 Standard overfitting and worst-case vulnerability

The absence of overfitting in the standard sense does not necessarily preclude MIAs [43, 29]. But, a straightforward implication of Proposition 2 shows there is a case when the standard generalization gap does bound the worst-case vulnerability:

Corollary 1.

Let $\ell(A_{S},x)=\mathds{1}[A_{S}(x)\neq y(x)]$ be the 0-1 loss, and the adversary’s features be the loss values $W=\ell(A_{S},X)$ . Then, the standard generalization gap equals worst-case vulnerability:

[TABLE]

Proof.

As loss is binary-valued, $R({\ell},{d_{\mathrm{TV}}})$ simplifies to:

[TABLE]

∎

Therefore, if a MIA adversary only observes whether a queried example has a correct or incorrect prediction by the target model, the upper bound on the success of any such attack has a direct relationship to standard overfitting $R({\ell},{d_{\mathrm{MD}}})$ . Thus, for such an adversarial model, no overfitting does imply no vulnerability to MIAs.

4.3 Disparate Vulnerability

In this section, we provide a theoretical analysis of vulnerability to MIAs disaggregated by subgroups.

We introduce a subgroup-specific version of distributional generalization, in which the distributions of the property $\pi$ are computed on examples that belong to a given subgroup. For any set $T$ from the range of $\pi$ , we define subgroup-specific measures:

[TABLE]

where $x\sim(\cdot\mid z)$ denotes sampling conditioned on the subgroup $z$ .

Definition 4.

For any property function $\pi(A_{S},x)$ , we define the subgroup-specific distributional-generalization gap:

[TABLE]

where $d(\mu,\mu^{\prime})$ is a measure of dissimilarity between probability distributions.

4.3.0.1 Subgroup vulnerability from distributional generalization

In order to extend the worst-case analysis to subgroups, we use the worst-case subgroup vulnerability under adversary’s features $W$ to the corresponding Bayes adversary: $V_{z}(\mathcal{A}^{*}_{W})$ . We show that this worst-case subgroup vulnerability is also related to distributional generalization:

Proposition 3.

The worst-case vulnerability of a subgroup $z$ is bounded:

[TABLE]

Moreover, for subgroup-aware adversaries the bound becomes an equality:

[TABLE]

We defer the proof to Appendix A.

4.3.0.2 Formalizing disparate vulnerability

Finally, having discussed subgroup vulnerability, we can analyze disparate vulnerability. Formally, let us define disparity in vulnerability.

Definition 5.

Disparity in vulnerability (or disparity for short) between two subgroups $z$ and $z^{\prime}$ is the difference in vulnerability of these subgroups:

[TABLE]

The previous results on the connection between subgroup vulnerability and distributional generalization enable us to relate disparity to degrees of distributional generalization across different population subgroups. From Proposition 3, we can see that the magnitude of disparity can be trivially bounded using distributional-generalization gaps of the involved subgroups:

Corollary 2.

Magnitude of disparity between subgroup $z$ and $z^{\prime}$ is upper bounded:

[TABLE]

Moreover, disparity has an exact closed form for subgroup-aware adversaries:

Corollary 3.

Suppose that a subgroup-aware adversary uses features $(W,Z)$ . Then, disparity between subgroups $z$ and $z^{\prime}$ is the difference between distributional generalization gaps of these subgroups:

[TABLE]

4.4 Takeaways

4.4.0.1 Necessary and sufficient condition for MIA existence

Without making any parametric assumptions, we have showed that the vulnerability to MIAs can be characterized using an extended notion of generalization, and that disparity is bounded by the difference in levels of distributional generalization across population subgroups. This interpretation of a standard result in statistical theory generalizes and complements the theoretical findings of Yeom et al. [43] and Sablayrolles et al. [36]. It also confirms that the presence of standard overfitting is not a necessary condition for MIAs to succeed [43, 29].

4.4.0.2 Hardness of defending against MIAs

The interpretation of worst-case vulnerability through distributional generalization has important consequences for practical defences against MIA that do not rely on differential privacy.

In order to reduce the vulnerability against adversaries that use features $W$ , the distribution of $W$ for examples that are outside of the training set has to be close to that for the training set examples. This means that, to avoid vulnerability, a target model has to—either implicitly or explicitly—learn the distribution of $W$ [23] which is a stronger requirement than what is typically necessary for its main task (i.e. generalization in terms of accuracy, or average error).

Moreover, adversaries are not limited to one set of features $W$ ; thus, the distribution has to be learned for a multitude of possible configurations of adversarial features $W$ . Additionally, to prevent disparity in vulnerability, the distribution of $W$ has to be learned across population subgroups, which is an even more challenging task.

5 Detecting and Measuring Disparate Vulnerability

We showed in Section 4 that vulnerability to MIAs appears when a model lacks in distributional generalization. The degree to which records are vulnerable can vary across subgroups in the data, potentially resulting in disparate vulnerability. In this section, we provide mechanisms to reliably estimate subgroup vulnerability and its disparity in practice.

To empirically estimate MIA vulnerability, we simulate the MIA game with a real attack. If we could play the game infinite times, then estimating the success probability of the adversary would be trivial. In practice, however, we can only run the game a finite amount of times, which provides us with a finite number of challenge examples $x$ . We group these examples into two sets of datasets of $n$ elements: a set of $r$ datasets $\{S_{i}\}_{i=1..r}$ composed of $n$ “in” examples (i.e., sampled as in line 4 of the MIA game, used for training), and $r$ datasets $\{\bar{S}_{i}\}_{i=1..r}$ composed of $n$ “out” examples (i.e., sampled as in line 6 of the MIA game, not used for training). Each pair of datasets $S_{i}$ and $\bar{S}_{i}$ can be seen as the train and test datasets of one model.

We define the estimate of vulnerability as:

[TABLE]

where $v_{i}$ is the model-specific estimate of vulnerability: the advantage of the adversary against a single target model. We compute $v_{i}$ for a pair of datasets $S_{i}$ and $\bar{S}_{i}$ as:

[TABLE]

As $r$ increases, $\smash{\hat{V}}(\mathcal{A})$ approximates the value of the true vulnerability $V$ .

We use the same approach to estimate subgroup vulnerability $V_{z}(\mathcal{A})$ , but we only use examples that belong to the subgroup of interest $z$ when computing the model-specific estimate of subgroup vulnerability $v_{i,z}$ . We omit $\mathcal{A}$ when it is clear from context.

5.1 Statistical Detection of Disparity

When evaluating subgroup vulnerability, we have to rely on subsets of $(S_{i},\bar{S}_{i})$ formed by subgroup examples. These subsets are possibly of size much smaller than $n$ . Due to the variance of the empirical averages in the Eq. 10, an estimate of subgroup vulnerability is in general less statistically reliable than the estimate of overall vulnerability that uses datasets $(S_{i},\bar{S}_{i})$ in their entirety. As a result, when estimating disparate vulnerability using the estimates of subgroup vulnerability, we need to statistically ensure that, if found, disparity is not due to random chance.

More formally, given estimates $\{v_{i,z}\}_{i=1..r}$ across different subgroups, we want to find statistical evidence that the actual subgroup vulnerabilities differ:

[TABLE]

5.1.0.1 Multiple subgroups

This problem is an instance of a standard within-subjects experimental design: We have multiple measurements (model-specific vulnerability estimates for different subgroups $v_{i,z_{1}},v_{i,z_{2}},\ldots,v_{i,z_{t}}$ ) for the same subject (model $A_{S_{i}}$ ). We want to know whether the means of vulnerability values differ across subgroups. Therefore, we can determine whether the training algorithm exhibits disparate vulnerability using the repeated-measures one-way anova model (see, e.g., Seltman [38, Chapter 14]). This approach enables us to use the anova F-test to establish whether there is evidence of disparate vulnerability. Following the standard protocol, if the F-test is positive, we perform post-hoc followup tests to determine which particular pairs of subgroups exhibit disparity. For the post-hoc tests, we use pairwise dependent t-tests with correction for multiple comparisons. As the correction method, we use the standard Benjamini-Hochberg procedure for controlling the false detection rate.

5.1.0.2 Two subgroups

When comparing only two subgroups, $z$ and $z^{\prime}$ , the procedure naturally simplifies to running one dependent t-test that checks if the difference between means of two groups is significant.

5.2 The Bias Problem

Some attacks in the literature assume that the adversary has additional knowledge beyond the tuple $(x,A_{S},n,\mathcal{D})$ . This knowledge can result in the vulnerability estimation being positively biased: indicating higher vulnerability than the actual worst case within the knowledge model of $(x,A_{S},n,\mathcal{D})$ . Overestimating vulnerability is not necessarily an issue, as pessimistic estimates incentivize caution in deployment. However, if the positive bias is correlated with the parameters of a subgroup (e.g., higher bias for smaller subgroups), it leads to incorrect conclusions about disparate vulnerability.

In this section, we check whether estimates of vulnerability using attacks proposed in the literature are biased. We evaluate three attacks:

–

Shadow-model attack [39]. An adversary trains a number of shadow models using the target training algorithm $A(\cdot)$ on datasets sampled from $\mathcal{D}^{n}$ . The adversary uses these shadow models to train a machine-learning classifier to estimate the probability $\Pr[M\mid W]$ . In our evaluation, we use 30 shadows and Gradient Boosting Trees as the attack classifier.

–

Average-threshold attack [43]. An adversary has additional knowledge: the average loss on the training dataset $\tau$ and the loss function $\ell$ used to compute this average, $\big{(}\tau,\ \ell(\cdot,\cdot)\big{)}$ , where $\tau\triangleq\sum_{x\in S}\ell(A_{S},x)$ . When attacking, the adversary uses $\tau$ as threshold to decide whether the challenge example was “in” (the example’s loss less than threshold) or “out” (greater than threshold).

–

Optimal-threshold attack [40, 6]. An adversary has additional knowledge: the loss function $\ell$ and the optimal loss threshold $\tau^{*}$ that separates the losses in the best way, $\big{(}\tau^{*},\ \ell(\cdot,\cdot))$ , where

[TABLE]

The attack proceeds as the average-threshold attack.

We deviate slightly from the attacks’ original formulations. The threshold attacks use $W=\ell(A_{S},X)$ as features, where the loss function is cross-entropy, whereas the original shadow-model attack used $W=(\smash{\hat{Y}},Y)$ . For fairness, we make all adversaries use the threshold attacks’ features.

As we want to evaluate subgroup-aware adversaries, we use features $W=\big{(}\ell(A_{S},X),Z\big{)}$ for all attacks, with cross-entropy as loss function. We make the attacks subgroup-aware as follows. For the shadow-model attack, the adversary trains separate attack classifiers for each subgroup, and then applies the appropriate classifier to each challenge example. For the threshold attacks, we assume the adversary has different thresholds for each subgroup [6, 41], i.e., average loss, respectively optimal threshold, per subgroup.

5.2.0.1 Method

It is hard to tell exactly if an estimate is higher than the worst-case vulnerability, as in practice the worst case is unknowable. We propose a simple test for bias within our adversarial model: run the estimation method against data-independent models. A target model can be independent of its training data, e.g., if it is completely random, constant, or trained with differential privacy parameter $\varepsilon\approx 0$ (see Section 6.2). If the model is independent of the data, we expect the estimates of overall and subgroup vulnerabilities, as well as disparity, to all be zero in expectation. We refer to any violation of this property as null-model bias. We are not only interested in whether a method exhibits such bias, but in whether this bias is correlated with subgroups.

5.2.0.2 Dataset

To have control over the distributions of subgroups and their representation, we create a synthetic dataset. We assume that the examples have binary class labels $y\in\{0,1\}$ , and belong to one of two subgroups $z\in\{C,T\}$ . We generate the examples from the multivariate normal distributions:

[TABLE]

where $\mathbf{1}^{d}$ is a $d$ -vector of all ones, and the covariance matrix $\Sigma$ is generated such that $||\Sigma||_{\max}\leq 1$ . We use $d=100$ dimensions, and set $\Pr[y=1]=\nicefrac{{1}}{{2}}$ . See Fig. 3 for an illustration.

To reflect that some subgroups can be harder to learn than others, the distributions are designed in such a way that the subgroup $z=C$ is more separable and hence more easily learnable than the subgroup $z=T$ . In our experiments we use the subgroup $z=C$ as the control (or majority) subgroup with fixed number of representatives in the data, and $z=T$ as the treatment (or minority) subgroup whose size we vary.

5.2.0.3 Setup

To see if the potential null-model bias depends on the sizes of subgroups, we generate multiple synthetic datasets such that each contains data belonging to two subgroups: control and treatment. The control subgroup has 1000 representatives in each dataset; the size of the treatment subgroup varies between 25 and 1000, with 8 distinct values. We run 8 experiments with different subgroup proportions. Within each experiment, we train 200 target models on freshly generated datasets. We set the target training algorithm to output the same classifier for any input training dataset. Recall that because the models are independent of the input, we expect all vulnerability estimates to be zero on average. We estimate disparity using three attacks described above, and run t-tests to see if the estimates are statistically significant as explained in Section 5.1.

5.2.0.4 Results on our synthetic dataset

In Fig. 4, we can see that the estimates of disparity produced with the shadow-model attack and the average-threshold attack are centered around zero, with the statistical tests confirming no significant difference from zero. The estimates coming from the optimal-threshold attack, however, are highly biased compared to the other attacks, as the estimates are consistently and significantly ( $p<0.001$ ) different from zero. The bias is always positive — overestimates disparity — and gets higher as the size of the treatment subgroup decreases. As the target models are independent of their training data and thus cannot have disparate vulnerability, we conclude that the use of the optimal-threshold attack results in significant null-model bias that grows as the subgroup size gets smaller.

5.2.0.5 Results on the dataset by Chang and Shokri [6]

To verify that our results are not artifacts of our specific synthetic data setup, we also reproduce the data setup used by Chang and Shokri to evaluate their subgroup-aware optimal-threshold attack. In their setup, they have one fixed dataset containing four subgroups that we denote as “0-0”, “0-1”, “1-0”, “1-1”, where the first number indicates simulated demographic group and the second number the class $y$ (we refer to the original work [6] for details). The subgroups have 50, 450, 1000, and 1000 examples, respectively, with the total dataset size of 2500 examples. Following Chang and Shokri, we randomly subsample training datasets of size 1250 from the full dataset, and train one model on each. As before, we “train” a data-independent model. In this experiment, we only use threshold attacks due to the small size of the dataset (see Section 7 for more details). We use the anova F-test as described in Section 5.1 to determine whether any of the subgroups have differing subgroup vulnerabilities.

Fig. 5 shows that significant null-model bias of the optimal-threshold attack also appears on this dataset (F-test $p<0.001$ ). In particular, the subgroup vulnerability for the smallest subgroup “0-0” with 50 examples appears as 4%. At the same time, the estimates from the average-threshold attack are centered around 0 and do not significantly differ (F-test $p\approx 0.1$ ), suggesting no null-model bias.

This bias, however, should not affect the conclusions by Chang and Shokri [6]. Rather than directly using the estimates of subgroup vulnerability, their analysis used differences in estimates of subgroup vulnerability between two models (a “fair” and a “regular” model). In their particular scenario, the bias introduced by the estimation should be cancelled out in the final difference. Although the conclusions of Chang and Shokri should not be affected by the bias, estimation methods such as the optimal-threshold attack should be avoided when evaluating disparate vulnerability in general.

5.2.0.6 Biased estimator in a prior version

A pre-print version of our work111https://arxiv.org/abs/1906.00389v2 used a vulnerability estimation method that, like the optimal-threshold attack, leveraged information about the training dataset of the target model. This estimator was therefore biased, and so were the numerical results of that version.

5.2.0.7 Takeaways

Biased estimators of vulnerability can result in consistent overestimation of disparity if the bias correlates with subgroup parameters. The shadow-model attack does not have such bias as it does not have access to any information about a specific target. Interestingly, the average-threshold attack, despite using an additional piece of knowledge that goes beyond our adversarial model, also does not exhibit such bias. On the contrary, the optimal-threshold attack produces significantly biased estimates for small groups.

Our results show the need to evaluate bias of the estimation method when measuring disparate vulnerability. To this end, we proposed to measure null-model bias, which detects bias when the worst-case vulnerability is zero. This test does not preclude a method from having bias if the worst-case vulnerability is larger. However, in practice MIA vulnerability has been shown to be relatively low.

5.3 Does Disparate Vulnerability Exist in ML Models?

Having established suitable methods for measuring disparate vulnerability, we apply them in a synthetic setup, and show that disparate vulnerability does arise in practice.

5.3.0.1 Setup

To capture the effect of subgroup size in the training data, we create several experiments with different subgroup proportions. Within each experiment, we sample 200 dataset pairs $S_{i}$ and $\bar{S}_{i}$ from our data distribution. In each dataset, the size of the control subgroup is fixed at 2500, and we vary the size of the treatment subgroup between experiments: 100, 500, 1000, and 2500. We estimate subgroup vulnerabilities using the subgroup-aware shadow-model attack (see Section 5.2), because this attack is guaranteed to not have null-model bias. As before, we use $W=(\ell(A_{S},X),Z)$ as adversary’s features. To train shadow models, we independently sample 30 fresh datasets from our data distribution. We use t-tests to determine whether measured disparity is statistically significant as described in Section 5.1.

5.3.0.2 Targets

We evaluate the following model families: logistic regression, and two ReLU neural networks with one hidden layer containing 8 and 32 neurons, respectively. We use the scikit-learn library [34] to train these models. All our models attain close to 100% test accuracy in our synthetic data setup.

5.3.0.3 Results

The results in Fig. 6 show that ML models can exhibit disparate vulnerability, even on a simple dataset. For all treatment sizes and targets, our estimates of disparity are significant ( $p<0.001$ ), with the exception of the logistic regression when the treatment subgroup is relatively well-represented (500 – 2500 examples). We also see that the sample size of the subgroup plays an important role in disparate vulnerability: the less represented is a group in the training data, the higher the disparate vulnerability as compared to a better represented group. Even though the sample size seems to be the dominant effect, we observe small but significant disparate vulnerability even when the subgroups are equally represented in training.

6 Mitigating Disparate Vulnerability

We now study whether some existing methods for addressing privacy and fairness in ML prevent disparate vulnerability.

6.1 Fairness Constraints

Due to the dependency of disparate vulnerability on the disparate behavior of the model across subgroups, minimizing the between-subgroup discrepancy in any given property, such as model’s outputs or loss [11], intuitively could decrease disparate vulnerability.

Formally, let us denote by $\mathsf{gap}^{\pi}$ the total-variation distance between distributions of some property of a model $\pi(A_{S},x)$ on examples coming from two subgroups $z$ and $z^{\prime}$ :

[TABLE]

With an appropriate choice of the property function, certain notions of algorithmic fairness can be seen as equivalent, or upper bounding, the above gap. For example, if we choose the model property to be its outputs, then for $\pi(A_{S},x)=A_{S}(x)$ , we obtain demographic parity [14]. Similarly, for the 0-1 loss property of the model, choosing $\pi(A_{S},x)=\mathds{1}[A_{S}(x)=y(x)]$ gives us accuracy equality [4].

In practice, a notion of fairness is satisfied on the training dataset rather than the whole data distribution. To capture this, we define an in-training gap as follows:

[TABLE]

The following proposition establishes that, if the in-training gap is bounded and the model generalizes its fairness condition well, then vulnerability disparity is bounded to adversaries that use the property addressed by the fairness notion:

Proposition 4.

Suppose a subgroup-aware adversary uses features $(W,Z)$ , and the following two conditions are satisfied:

Fairness on the training data: $\mathsf{gap}^{\phi_{W}}_{S}\leq\gamma$ 2. 2.

Fairness generalization: $|\mathsf{gap}^{\phi_{W}}-\mathsf{gap}^{\phi_{W}}_{S}|\leq\delta$

*Then, the magnitude of disparity in worst-case vulnerability is bounded as follows: *

[TABLE]

Proof of Proposition 4.

First, observe that a combination of the two conditions implies:

[TABLE]

By this implication and the triangle property of total variation we have that:

[TABLE]

Applying the triangle inequality to the underlined term:

[TABLE]

Combining the two,

[TABLE]

Implying:

[TABLE]

If we apply the previous steps analogously we can also obtain:

[TABLE]

Thus,

[TABLE]

Combining the inequalities, we get:

[TABLE]

By Corollary 3, we obtain the sought bound. ∎

We note that these guarantees only apply to adversaries targeting the features addressed by implemented the fairness notion. In other words, just as in algorithmic-fairness literature where no single fairness measure is appropriate in a general context [17], no one fairness measure can provide guarantees for bounding disparate vulnerability for any adversary.

6.1.1 Empirical Evaluation

6.1.1.1 Fairness notions

To validate the theoretical results, we estimate vulnerability of models that satisfy two algorithmic-fairness notions: First, demographic parity [14] which ensures that distributions of model outputs between demographic subgroups are close: $\mathsf{gap}^{\phi_{\smash{\hat{Y}}}}\approx 0$ . Second, equalized odds, which ensures that true-positive rates and false-positive rates between the subgroups are close [19]. We choose these notions as they are common in the literature, and there exist efficient algorithms and tooling for producing classifiers that satisfy them. To train the classifiers, we use the threshold post-processing approach [19] from the fairlearn library [5], applied to a logistic regression classifier.

6.1.1.2 Setup

Within the setup of Section 5.3, we run the following two experiments:

E1

We fulfill the requirements of Proposition 4. For this, we estimate vulnerability using features equalized by demographic parity: $W=\left(\smash{\hat{Y}},Z\right)$ . By Proposition 4, we expect low disparity in vulnerability for both classifiers as long as they generalize their fairness property well. In Appendix A, we show that in our data setup equalized odds implies demographic parity, thus the theoretical guarantee also applies for equality of odds.

E2

We estimate vulnerability using adversary’s features $W=\left(\ell(A_{S},X),Z\right)$ which do not match what the fairness property does, so the requirements of Proposition 4 are not fulfilled.

We find that with 100 dimensions in our data setup, the threshold-optimization algorithm produces models that classify the data with 100% accuracy and no vulnerability. Thus, to demonstrate a setting where disparate vulnerability arises, we deviate from the parameters of Section 5.3 and we use the synthetic dataset with 10 dimensions.

6.1.1.3 Results

We present the results in Fig. 7. For E1, we see that demographic parity decreases disparate vulnerability compared to standard logistic regression. This empirically confirms Proposition 4. For E2, as expected, both equalized odds and demographic parity do not completely prevent disparate vulnerability. Yet, they do decrease its magnitude by 3 $\times$ compared to the standard logistic regression.

In our particular setup, the constrained models do not perform worse than the unconstrained models. In general, however, fairness notions can be inherently at odds with accuracy [44].

6.2 Differentially Private Training

In this section, we look at how learning with differential privacy [13] relates to disparity in vulnerability. We use the basic notion of differential privacy:

Definition 6.

Training algorithm $A$ satisfies $\varepsilon$ -differential privacy (DP) if for any two datasets $S,S^{\prime}$ differing by the records of one individual, for any set of models $T$ :

[TABLE]

DP training limits the contribution of any individual in the dataset to the model training. Thus, DP should decrease vulnerability to MIAs. In particular, Yeom et al. [43], Chatzikokolakis et al. [7] and Humphries et al. [21], showed the advantage of a MIA adversary is bounded by DP in the setting of the MIA game. For example:

Proposition 5 (Adapted from Yeom et al. [43]).

If the training algorithm satisfies $\varepsilon$ -DP, the worst-case vulnerability with any adversary’s features $W$ is bounded:

[TABLE]

These guarantees extend to disparate vulnerability:

Proposition 6.

If the training algorithm satisfies $\varepsilon$ -DP, the worst-case subgroup vulnerability of any $z$ , as well as magnitude of vulnerability disparity between any subgroups $z$ and $z^{\prime}$ , are uniformly bounded for any adversary’s features $W$ :

[TABLE]

Proof.

Observe that the following probability distributions are equivalent:

[TABLE]

Notice that datasets $S^{\prime}\cup\{x\}$ and $S^{\prime}\cup\{x^{\prime}\}$ differ by the records of at most one individual. Therefore, for any fixed dataset $S^{\prime}$ , the post-processing property of differential privacy applies:

[TABLE]

Taking expectation over $S^{\prime}$ of both sides, we obtain:

[TABLE]

By equivalence in Eq. 14:

[TABLE]

To get the bound on subgroup vulnerability, recall that by Proposition 3 it is upper bounded by the total variation. Thus, for any set of feature values $T$ :

[TABLE]

Applying Corollary 2, we also get the bound on disparity. ∎

6.2.1 Empirical Evaluation

To study how DP affects disparate vulnerability we train DP models with different privacy levels. As target models, we use DP logistic regression with private empirical risk minimization [8], trained using the diffprivlib [20] implementation. We use a min-max scaler, and provide a maximum row norm estimated on a separate sample from the data distribution. We use privacy levels $\varepsilon=0.1,1,2,10$ .

We see in Fig. 8 that, for all evaluated values of $\varepsilon$ , DP training considerably reduces disparity compared to the non-private logistic regression, with statistical tests not detecting significant deviations from 0.

On the downside, unlike training with fairness constraints, DP training results in a significant decrease in accuracy of the models: from $45$ p.p. to $5$ p.p. drop depending on the value of $\varepsilon$ .

6.3 Takeaways

Fairness only bounds disparate vulnerability in certain scenarios. Even when the classifier’s fairness property generalizes beyond the training set, the bound is restricted to the adversarial strategy covered by the chosen fairness notion. Covering one adversarial strategy, however, is a weak security guarantee: the model could be (disparately) vulnerable to other strategies. Moreover, it is known that different fairness constraints are at odds with each other [17]. Hence, a model protected by one fairness notion may be inherently insecure against adversaries exploiting non-protected features.

Differential privacy bounds disparate vulnerability. We show that DP provides an upper bound on the vulnerability of all individuals, subgroups, and therefore on disparate vulnerability too. On the flip side, because DP guarantees are often at odds with accuracy, in practical applications $\varepsilon$ is usually set high, allowing for a lot of variation within the upper bound of Proposition 6. Practically, the particular approach to DP training that we evaluated has mitigated disparity even with a high privacy level $\varepsilon=10$ that results in vacuous theoretical bounds, but at significant accuracy costs.

7 Evaluation using Real-World Datasets

To investigate if we can detect disparate vulnerability in a realistic setting, we use the following two datasets as case studies:

–

adult dataset [25]. The dataset contains 48,842 examples from the 1994 Census database222https://archive.ics.uci.edu/ml/datasets/adult. The task is to determine if a yearly salary is over/under $50K. It contains attributes such as age, sex, education, race, native country, etc. After one-hot encoding, the dataset contains 91 features. We use the race column as the subgroup attribute.

–

texas-50K dataset. We create this dataset based on 2013 Texas Hospital Discharge data333https://www.dshs.texas.gov/THCIC/Hospitals/Download.shtm. As our evaluation setup is computationally expensive, to accommodate the same training algorithms as used in the synthetic data experiments, we randomly subsample 50,000 examples, and reduce the number of features for training. We use the following columns: type of admission, illness severity, mortality risk, principal diagnosis code (out of more than 6000 codes, we only keep the top 1000 and create one separate code for the rest), length of stay, and patient’s demographic attributes: sex, race, ethnicity. After one-hot encoding, we have 1025 features. We use the race column as the subgroup attribute. As a task, analogously to the adult dataset, we use prediction of whether the total amount of charges is greater than a threshold (e.g., for health-insurance risk-scoring). As the threshold we pick the median total charges on the subsampled dataset.

Table 1 provides details about the subgroups.

7.0.0.1 Target models

We consider as target models logistic regression and neural networks with 8 and 32 neurons in the hidden layer (Section 5.3), logistic regression with fairness constraints (Section 6.1), and differentially private logistic regression with $\varepsilon$ values 1, 2, and 10 (Section 6.2). All the models beat the random accuracy baseline on the tasks.

7.0.0.2 Estimation method

As opposed to our synthetic data setup in which datasets to train shadow models can be directly sampled from the data-generating distribution, when real data is involved we can only sample data from the available finite dataset. We split the dataset in two parts: one for training of the shadow models, and one for evaluation of vulnerability [39]. As a result, the amount of available training data is greatly reduced, in particular, for minority subgroups that already have few representatives in the dataset. To avoid this problem, in this section we use the average-threshold attack for vulnerability estimation, which does not require training shadow models. Our evaluation in Section 5.2 showed that this attack is not null-model biased.

7.0.0.3 Setup

To train each target model, we randomly subsample 50% of the dataset to use for training ( $S_{i}$ ), and hold out the remaining data ( $\bar{S}_{i}$ ). We train 200 models for each model family on different splits of the dataset. For our statistical tests (see Section 5.1), we use $\alpha=0.01$ as significance level.

7.0.0.4 Results

We summarize the results in Table 2. As in our synthetic experiments, we observe evidence of disparity in neural networks. Importantly, the results show that low vulnerability in absolute terms does not imply absence of disparity. On adult, the 8-neuron network shows relatively low $0.4\%$ vulnerability but statistically significant disparity $(p<10^{-4})$ . Interestingly, on texas-50K, we also see statistical evidence of disparate vulnerability for logistic regression with demographic-parity constraints, although its overall vulnerability of 1.46% is comparable to standard logistic regression.

For the models with F-test $p<0.01$ , we conduct follow-up post-hoc tests to see which particular pairs of subgroups have high disparity (we defer the detailed results of the post-hoc tests to Appendix B). On adult, consistently with our synthetic experiments, the smaller subgroups “Asian-Pac-Islander” (AI, 1,302 examples), and “Other” (OT, 353 examples), exhibit disparity between themselves and other more populous subgroups. On texas-50K, almost all subgroup pairs exhibit significant disparity for 32-neuron network.

The results for the logistic regression with fairness constraints are unlike the synthetic experiments. As opposed to a minority subgroup, as in the previous results, disparity appears between the most populous subgroup “4” (31,514 examples) and subgroups “2”, “3” and “5”. This disparity does not exist in the standard logistic regression. Thus, this result shows that fairness constraints can introduce disparity when the conditions of Proposition 4 are not met.

7.0.0.5 Discussion

We have used binary classification tasks for compatibility with the fairness definitions, but we expect disparity to be more pronounced in multi-class settings. As detailed in Section 4.4, disparate vulnerability is bound to happen whenever a model does not faithfully learn the distributional properties of the data for some subgroups. Prior research suggests it is likely to appear when the task has many features, or many classes in the case of classification [37].

We also only considered relatively small dataset sizes. Bigger datasets, on the one hand, enable better learning of the models thus decreasing vulnerability and disparate vulnerability, but on the other hand, they would enable the adversary to use shadow-model attacks that could provide better results than the average-threshold attack used in our experiments.

We leave for future work the investigation of the effect of number of classes and dataset size on disparate vulnerability.

8 Conclusions

We have provided the first formal analysis of the disparate vulnerability of population subgroups to membership inference attacks. Our analysis provides new insights into why and when vulnerability to MIAs arises and why and when these attacks have disparate impact.

8.0.0.1 Key takeaways

The first key learning of our study is that fully preventing MIAs, and thus preventing disparate vulnerability can only be done in two ways. Either by significantly increasing the complexity of the learning problem to ensure distributional generalization; or using a differentially-private training algorithm with the associated hit on performance.

The second learning surfaces a more general problem: the consequences of the unreliability of privacy estimation for demographic groups with a minority representation in the data. We show that for small subgroups it is easy to incorrectly estimate their protection indirectly via aggregate privacy measures, or directly when not considering biases adequately.

8.0.0.2 Why disparate vulnerability is important

Disparate vulnerability has crucial legal and policy significance. Companies moving data between organizations or across borders face frictions designed to protect fundamental rights established by the approximately 140 countries with largely conceptually and textually similar privacy regulation around the world [18]. For example, moving data from Europe into a country with significant state surveillance apparatus, such as the United States, is difficult after the European Court of Justice’s judgement in Schrems II. Other countries, such as several in South Asia, have established specific personal data localization laws [3]. As a consequence, there is growing interest in attempting to replace a direct trade in personal data with various forms of trade in models trained on this data.

Yet vulnerability of models to MIAs or other attacks compromising confidentiality might in some situations qualify models themselves as personal data [42]. The accountability principle in European data protection law places the onus on data controllers to demonstrate that a model should not be classified this way, for example through privacy-estimation techniques. Our study indicates there is a real risk of “privacy-washing”, laundering a model with aggregate statistics that mask vulnerabilities of subgroups. It is true that prior work has also indicated that aggregate analysis can hide MIA vulnerability to attacks focusing on structurally vulnerable records [29]. However, this appears easier to dismiss as an acceptable residual leakage risk compared to disparate risks concerning members of salient minority groups, as in a liberal democracy, a regulator is more accountable towards these than towards a socially arbitrary selection of persons.

8.0.0.3 Open challenges

Our results also uncover a new challenge. It is difficult for auditors or regulators to practically inspect disparate vulnerability, because they might lack a sufficient number of examples relating to a minority group. When the subgroup data is scarce, our methods could be underpowered to detect disparity; however, not using the statistical tests and unbiased estimation methods from Section 5 risks flagging disparity always when subgroups differ, devaluing the meaning of the estimate.

This points to a need for theoretical results that can be used as foundation in practical regulatory contexts. Theoretical results may be able to help regulators better ascertain the limits of any metrics presented to them, and the conditions under which a model is structurally likely to be vulnerable to different types of privacy attacks even without difficult-to-obtain empirical evidence. The initial results provided in this paper can already significantly contribute to discussions around the classification of machine learning systems in relation to their risk of data leakage as business practices of using models to transport information continue to evolve.

Acknowledgements

The authors would like to thank Maksym Andriushchenko and Simon Oya for the helpful feedback and discussions. We also thank Hongyan Chang and Reza Shokri for clarifying their use of risk estimates [6].

This work was partially funded by the Swiss National Science Foundation with grant 200021-188824.

Appendix A Proofs

In this section we provide the omitted proofs.

A.1 Regular vs. Subgroup-Aware Vulnerability

A.1.0.1 Proposition 1

$V(\mathcal{A}^{*}_{W,Z})\geq V(\mathcal{A}^{*}_{W})$ *. *

Proof of Proposition 1.

Recall that the Bayes adversary uses a Bayes-optimal classifier that maximizes the success probability (i.e., vulnerability) among all the possible classifiers. That is, for the regular and subgroup-aware adversaries, we have respectively:

[TABLE]

Let $F=\{f\mid f=g\circ h,h(w,z)=w,g:{\mathbb{W}}\mapsto\{0,1\}\}$ ; that is, $F$ is the set of functions $f:{\mathbb{W}}\times{\mathbb{Z}}\mapsto\{0,1\}$ that first reduce the tuple $(w,z)$ to $w$ and then apply a function $g$ to the remaining input. Clearly, $F\subset\{g\mid g:{\mathbb{W}}\times{\mathbb{Z}}\mapsto\{0,1\}\}$ .

Then, to prove this proposition it suffices to observe that the regular adversary is equivalent to a subgroup-aware one restricted to the set of functions $F$ .

[TABLE]

∎

A.2 Subgroup Vulnerability

To prove Proposition 3, we use the following statement:

Proposition 7.

For any two discrete probability measures $\mu$ and $\mu^{\prime}$ the following holds:

[TABLE]

Proof.

First, observe:

[TABLE]

Rearranging and grouping the terms, we get:

[TABLE]

∎

Proof of Proposition 3.

We provide a proof for the case of discrete features $W$ . The proof is analogous in the case of absolutely continuous $W$ . Note that for discrete measures $\mu$ and $\mu^{\prime}$ , $d_{\mathrm{TV}}(\mu,\mu^{\prime})=\frac{1}{2}||\mu-\mu^{\prime}||_{1}$ .

For convenience, let us define feature gaps as follows:

[TABLE]

Adversary’s success for a subgroup has the following form that is useful for our proof:

[TABLE]

First, suppose that $Z\notin W$ . Consider the following set:

[TABLE]

For a given $z$ , the set $C$ is a union of two other disjoint sets $A$ and $B$ ; $C=A\cup B$ :

[TABLE]

Thus, the sum in Eq. 15 can be decomposed into $\sum_{A}\mathsf{gap}_{z}(w)+\sum_{B}\mathsf{gap}_{z}(w)$ , where

[TABLE]

The last equality is by Proposition 7. Applying this bound to Eq. 15 we obtain the sought Eq. 5.

Second, suppose that $Z\in W$ . Let $w=(\cdots,z^{\prime})$ . If $z^{\prime}\neq z$ , then $\mathsf{gap}_{z}(w)=0$ , and so we only need to consider the case $z^{\prime}=z$ . In this case:

[TABLE]

After plugging this into Eq. 15, we obtain the equality in Eq. 6 by Proposition 7. ∎

A.3 A Note on Equalized Odds vs. Demographic Parity

Let us define equalized odds (EO). With probabilities over the data distribution, a classifier satisfies EO if:

[TABLE]

In these terms, demographic parity is defined as the following requirement for a classifier:

[TABLE]

In general, these two notions are not equivalent. In our synthetic data setup (Section 5.2), however, it holds that (a) the distributions of classes are the same across subgroups: $\Pr[Y\mid Z=Z]=\Pr[Y\mid Z=Z^{\prime}]$ , and (b) the two classes are balanced: $\Pr[Y=1]=\Pr[Y=0]=\nicefrac{{1}}{{2}}$ . It is easy to see that in this case, EO implies demographic parity.

Appendix B Additional Tables

The rest of the appendix contains additional tables.

Bibliography44

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bagdasaryan et al. [2019] Eugene Bagdasaryan, Omid Poursaeed, and Vitaly Shmatikov. Differential privacy has disparate impact on model accuracy. In Annual Conference on Neural Information Processing Systems, Neur IPS , 2019.
2Barocas and Selbst [2016] Solon Barocas and Andrew D Selbst. Big data’s disparate impact. Calif. L. Rev. , 2016.
3Basu et al. [2019] Arindrajit Basu, Elonnai Hickok, and Aditya Singh Chawala. The Localisation Gambit: Unpacking Policy Measures for Sovereign Control of Data in India. Centre for Internet and Society, India , 2019.
4Berk et al. [2018] Richard Berk, Hoda Heidari, Shahin Jabbari, Michael Kearns, and Aaron Roth. Fairness in criminal justice risk assessments: The state of the art. Sociological Methods & Research , 2018.
5Bird et al. [2020] Sarah Bird, Miro Dudík, Richard Edgar, Brandon Horn, Roman Lutz, Vanessa Milan, Mehrnoosh Sameki, Hanna Wallach, and Kathleen Walker. Fairlearn: A toolkit for assessing and improving fairness in AI. Technical Report MSR-TR-2020-32, Microsoft, May 2020. URL https://www.microsoft.com/en-us/research/publication/fairlearn-a-toolkit-for-assessing-and-improving-fairness-in-ai/ .
6Chang and Shokri [2021] Hongyan Chang and Reza Shokri. On the privacy risks of algorithmic fairness. IEEE European Symposium on Security and Privacy, Euro S&P , 2021.
7Chatzikokolakis et al. [2020] Konstantinos Chatzikokolakis, Giovanni Cherubin, Catuscia Palamidessi, and Carmela Troncoso. The Bayes security measure. ar Xiv preprint ar Xiv:2011.03396 , 2020.
8Chaudhuri et al. [2011] Kamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. Differentially private empirical risk minimization. J. Mach. Learn. Res. , 2011.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Disparate Vulnerability to Membership Inference Attacks

Abstract

1 Introduction

2 Related work

2.0.0.1 Theory studies on MIA

2.0.0.2 Disparity and machine learning

3 Membership Inference Attacks

3.1 Attack strategy

3.2 Vulnerability

Definition 1**.**

Definition 2**.**

3.2.0.1 Optimal adversaries

3.2.0.2 Subgroup-aware Bayes adversary

Proposition 1**.**

4 Distributional Generalization and Vulnerability to MIAs

4.1 Distributional Generalization

Definition 3**.**

4.2 Relation between Worst-case Vulnerability and Distributional Generalization

Proposition 2**.**

4.2.0.1 Standard overfitting and worst-case vulnerability

Corollary 1**.**

4.3 Disparate Vulnerability

Definition 4**.**

4.3.0.1 Subgroup vulnerability from distributional generalization

Proposition 3**.**

4.3.0.2 Formalizing disparate vulnerability

Definition 5**.**

Corollary 2**.**

Corollary 3**.**

4.4 Takeaways

4.4.0.1 Necessary and sufficient condition for MIA existence

4.4.0.2 Hardness of defending against MIAs

5 Detecting and Measuring Disparate Vulnerability

5.1 Statistical Detection of Disparity

5.1.0.1 Multiple subgroups

5.1.0.2 Two subgroups

5.2 The Bias Problem

5.2.0.1 Method

5.2.0.2 Dataset

5.2.0.3 Setup

5.2.0.4 Results on our synthetic dataset

5.2.0.5 Results on the dataset by Chang and Shokri [6]

5.2.0.6 Biased estimator in a prior version

5.2.0.7 Takeaways

5.3 Does Disparate Vulnerability Exist in ML Models?

5.3.0.1 Setup

5.3.0.2 Targets

5.3.0.3 Results

6 Mitigating Disparate Vulnerability

6.1 Fairness Constraints

Proposition 4**.**

6.1.1 Empirical Evaluation

6.1.1.1 Fairness notions

6.1.1.2 Setup

6.1.1.3 Results

6.2 Differentially Private Training

Definition 6**.**

Proposition 5** (Adapted from Yeom et al. [43]).**

Proposition 6**.**

6.2.1 Empirical Evaluation

6.3 Takeaways

7 Evaluation using Real-World Datasets

7.0.0.1 Target models

7.0.0.2 Estimation method

7.0.0.3 Setup

7.0.0.4 Results

7.0.0.5 Discussion

8 Conclusions

8.0.0.1 Key takeaways

8.0.0.2 Why disparate vulnerability is important

8.0.0.3 Open challenges

Acknowledgements

Appendix A Proofs

Definition 1.

Definition 2.

Proposition 1.

Definition 3.

Proposition 2.

Corollary 1.

Definition 4.

Proposition 3.

Definition 5.

Corollary 2.

Corollary 3.

Proposition 4.

Definition 6.

Proposition 5 (Adapted from Yeom et al. [43]).

Proposition 6.

Proposition 7.