On Privacy Protection of Latent Dirichlet Allocation Model Training

Fangyuan Zhao; Xuebin Ren; Shusen Yang; Xinyu Yang

arXiv:1906.01178·cs.LG·July 2, 2019

On Privacy Protection of Latent Dirichlet Allocation Model Training

Fangyuan Zhao, Xuebin Ren, Shusen Yang, Xinyu Yang

PDF

Open Access

TL;DR

This paper investigates privacy risks in LDA model training and proposes privacy-preserving algorithms, including a privacy monitoring method and a locally private training algorithm, validated by experiments on real datasets.

Contribution

It introduces novel privacy-preserving algorithms for LDA training, addressing both inherent randomness and local differential privacy in crowdsourced data.

Findings

01

The inherent randomness of CGS provides some privacy guarantees.

02

The locally private LDA algorithm achieves differential privacy for individual data contributors.

03

Experimental results confirm the effectiveness of the proposed privacy-preserving methods.

Abstract

Latent Dirichlet Allocation (LDA) is a popular topic modeling technique for discovery of hidden semantic architecture of text datasets, and plays a fundamental role in many machine learning applications. However, like many other machine learning algorithms, the process of training a LDA model may leak the sensitive information of the training datasets and bring significant privacy risks. To mitigate the privacy issues in LDA, we focus on studying privacy-preserving algorithms of LDA model training in this paper. In particular, we first develop a privacy monitoring algorithm to investigate the privacy guarantee obtained from the inherent randomness of the Collapsed Gibbs Sampling (CGS) process in a typical LDA training algorithm on centralized curated datasets. Then, we further propose a locally private LDA training algorithm on crowdsourced data to provide local differential privacy for…

Tables1

Table 1. Table 1: Details about the real-world datasets

Dataset	$# .$ words	$# .$ training docs	$# .$ test docs
KOS	209169	3000	430
NIPS	410753	1349	150
Enron	356363	8000	2000

Equations85

p (z_{i} = k ∣ z_{\neg i}, w) \propto \frac{n _{k}^{t} + β}{\sum _{t = 1}^{V} ( n _{k}^{t} + β )} \cdot \frac{n _{m}^{k} + α}{\sum _{k = 1}^{K} ( n _{m}^{k} + α )}

p (z_{i} = k ∣ z_{\neg i}, w) \propto \frac{n _{k}^{t} + β}{\sum _{t = 1}^{V} ( n _{k}^{t} + β )} \cdot \frac{n _{m}^{k} + α}{\sum _{k = 1}^{K} ( n _{m}^{k} + α )}

E [ϕ_{k}^{t} ∣ z, w] = \frac{n _{k}^{t} + β}{\sum _{t = 1}^{V} ( n _{k}^{t} + β )}

E [ϕ_{k}^{t} ∣ z, w] = \frac{n _{k}^{t} + β}{\sum _{t = 1}^{V} ( n _{k}^{t} + β )}

P r [M (D) \in S] \leq e^{ε} \cdot P r [M (D^{^{'}}) \in S]

P r [M (D) \in S] \leq e^{ε} \cdot P r [M (D^{^{'}}) \in S]

P r [f (t) = t^{*}] \leq e^{ε} P r [f (t^{'}) = t^{*}]

P r [f (t) = t^{*}] \leq e^{ε} P r [f (t^{'}) = t^{*}]

p_{k} = e^{\frac{( 2Δ l n p _{k} ) l n p _{k}}{2Δ l n p _{k}}},

p_{k} = e^{\frac{( 2Δ l n p _{k} ) l n p _{k}}{2Δ l n p _{k}}},

p_{k} \propto r_{k} = \frac{n _{k}^{t} + β}{\sum _{t = 1}^{V} ( n _{k}^{t} + β )} \cdot \frac{n _{m}^{k} + α}{\sum _{k = 1}^{K} ( n _{m}^{k} + α )}

p_{k} \propto r_{k} = \frac{n _{k}^{t} + β}{\sum _{t = 1}^{V} ( n _{k}^{t} + β )} \cdot \frac{n _{m}^{k} + α}{\sum _{k = 1}^{K} ( n _{m}^{k} + α )}

p_{k}^{'} \propto r_{k}^{'} = \frac{n _{k}^{t} + β}{\sum _{t = 1}^{V} ( n _{k}^{t} + β ) - N _{k}} \cdot \frac{n _{m}^{k} + α}{\sum _{k = 1}^{K} ( n _{m}^{k} + α )}

p_{k}^{'} \propto r_{k}^{'} = \frac{n _{k}^{t} + β}{\sum _{t = 1}^{V} ( n _{k}^{t} + β ) - N _{k}} \cdot \frac{n _{m}^{k} + α}{\sum _{k = 1}^{K} ( n _{m}^{k} + α )}

ε_{γ} = k \in {1, 2, ..., K} max {2 ξ_{k}} = k \in {1, 2, ..., K} max {2 ln \frac{p ^{'} _{k}}{p _{k}}}

ε_{γ} = k \in {1, 2, ..., K} max {2 ξ_{k}} = k \in {1, 2, ..., K} max {2 ln \frac{p ^{'} _{k}}{p _{k}}}

ε_{γ} = 2 ξ_{k} = 2 max {ξ_{1} ξ_{2} ..., ξ_{K}} = 2 ln \frac{\sum _{k} r _{k}^{^{'}}}{\sum _{k} r _{k}},

ε_{γ} = 2 ξ_{k} = 2 max {ξ_{1} ξ_{2} ..., ξ_{K}} = 2 ln \frac{\sum _{k} r _{k}^{^{'}}}{\sum _{k} r _{k}},

ln \frac{\sum _{t = 1}^{V} ( n _{k}^{t} + β )}{\sum _{t = 1}^{V} ( n _{k}^{t} + β ) - N _{j}} < 2 ln \frac{\sum _{k} r _{k}^{^{'}}}{\sum _{k} r _{k}}

ln \frac{\sum _{t = 1}^{V} ( n _{k}^{t} + β )}{\sum _{t = 1}^{V} ( n _{k}^{t} + β ) - N _{j}} < 2 ln \frac{\sum _{k} r _{k}^{^{'}}}{\sum _{k} r _{k}}

ln \frac{\sum _{t = 1}^{V} ( n _{k}^{t} + β )}{\sum _{t = 1}^{V} ( n _{k}^{t} + β ) - N _{k}} > 2 ln \frac{\sum _{k} r _{k}^{^{'}}}{\sum _{k} r _{k}}

ln \frac{\sum _{t = 1}^{V} ( n _{k}^{t} + β )}{\sum _{t = 1}^{V} ( n _{k}^{t} + β ) - N _{k}} > 2 ln \frac{\sum _{k} r _{k}^{^{'}}}{\sum _{k} r _{k}}

ε_{γ} = 2 k \in T max {ln (\frac{\sum _{k} r _{k}^{'}}{\sum _{k} r _{k}} \cdot \frac{r _{k}}{r _{k}^{'}})} > 2 ln \frac{\sum _{k} r _{k}^{'}}{\sum _{k} r _{k}}

ε_{γ} = 2 k \in T max {ln (\frac{\sum _{k} r _{k}^{'}}{\sum _{k} r _{k}} \cdot \frac{r _{k}}{r _{k}^{'}})} > 2 ln \frac{\sum _{k} r _{k}^{'}}{\sum _{k} r _{k}}

P = {γ ∣\exists k, s . t N_{k} = N, N_{j} = 0, \forall j \neq = k}

P = {γ ∣\exists k, s . t N_{k} = N, N_{j} = 0, \forall j \neq = k}

γ^{*} \sum r_{k}^{'} = γ max {k \sum r_{k}^{'} ∣ γ \in Γ}

γ^{*} \sum r_{k}^{'} = γ max {k \sum r_{k}^{'} ∣ γ \in Γ}

q_{k} = \frac{n _{k}^{t} + β}{\sum _{t = 1}^{V} ( n _{k}^{t} + β ) - N} \cdot \frac{n _{m}^{k} + α}{\sum _{k = 1}^{K} ( n _{m}^{k} + α )}

q_{k} = \frac{n _{k}^{t} + β}{\sum _{t = 1}^{V} ( n _{k}^{t} + β ) - N} \cdot \frac{n _{m}^{k} + α}{\sum _{k = 1}^{K} ( n _{m}^{k} + α )}

P = {γ ∣\exists k, s . t N_{k} = N, N_{j} = 0, \forall j \neq = k}

P = {γ ∣\exists k, s . t N_{k} = N, N_{j} = 0, \forall j \neq = k}

ε_{γ^{'}} = γ max {ε ∣ γ \in Γ} = 2 ln (\frac{\sum _{j \neq = k} r _{j} + q _{k}}{\sum _{j} r _{j}})

ε_{γ^{'}} = γ max {ε ∣ γ \in Γ} = 2 ln (\frac{\sum _{j \neq = k} r _{j} + q _{k}}{\sum _{j} r _{j}})

\hat{V}_{m} [i] = ⎩ ⎨ ⎧ V_{m} [j], 1, 0, with probability of 1 - f with probability of f /2 with probability of f /2

\hat{V}_{m} [i] = ⎩ ⎨ ⎧ V_{m} [j], 1, 0, with probability of 1 - f with probability of f /2 with probability of f /2

ε = ln \frac{P r ( V ^ _{m} [ t ] = 1∣ V _{m} [ t ] = 1 )}{P r ( V ^ _{m} [ t ] = 1∣ V _{m} [ t ] = 0 )} = ln \frac{1 - f /2}{f /2} .

ε = ln \frac{P r ( V ^ _{m} [ t ] = 1∣ V _{m} [ t ] = 1 )}{P r ( V ^ _{m} [ t ] = 1∣ V _{m} [ t ] = 0 )} = ln \frac{1 - f /2}{f /2} .

\hat{N_{t}} = \frac{2 n _{t} - f M}{2 ( 1 - f )}

\hat{N_{t}} = \frac{2 n _{t} - f M}{2 ( 1 - f )}

D (\hat{N_{t}}) = \frac{( 2 - f ) f M}{4 ( 1 - f ) ^{2}} .

D (\hat{N_{t}}) = \frac{( 2 - f ) f M}{4 ( 1 - f ) ^{2}} .

\hat{N_{t}} = \frac{2 n _{t} - f M}{2 ( 1 - f )}

\hat{N_{t}} = \frac{2 n _{t} - f M}{2 ( 1 - f )}

D (\hat{N_{t}})

D (\hat{N_{t}})

ξ_{j} = ∣ ln \frac{p _{j}^{^{'}}}{p _{j}} ∣ = ∣ ln (\frac{\sum _{k} r _{k}^{^{'}}}{\sum _{k} r _{k}} \cdot \frac{\sum _{t} ( n _{j}^{t} + β ) - N _{j}}{\sum _{t} ( n _{j}^{t} + β )}) ∣

ξ_{j} = ∣ ln \frac{p _{j}^{^{'}}}{p _{j}} ∣ = ∣ ln (\frac{\sum _{k} r _{k}^{^{'}}}{\sum _{k} r _{k}} \cdot \frac{\sum _{t} ( n _{j}^{t} + β ) - N _{j}}{\sum _{t} ( n _{j}^{t} + β )}) ∣

ξ_{k} - ξ_{j}

ξ_{k} - ξ_{j}

- ∣ ln \frac{\sum _{k} r _{k}^{'}}{\sum _{k} r _{k}} + ln \frac{\sum _{t} ( n _{j}^{t} + β ) - N _{j}}{\sum _{t} ( n _{j}^{t} + β )} ∣

∣ ln \frac{\sum _{k} r _{k}^{^{'}}}{\sum _{k} r _{k}} + ln \frac{\sum _{t} ( n _{j}^{t} + β ) - N _{k}}{\sum _{t} ( n _{j}^{t} + β )} ∣ < ln \frac{\sum _{k} r _{k}^{^{'}}}{\sum _{k} r _{k}}

∣ ln \frac{\sum _{k} r _{k}^{^{'}}}{\sum _{k} r _{k}} + ln \frac{\sum _{t} ( n _{j}^{t} + β ) - N _{k}}{\sum _{t} ( n _{j}^{t} + β )} ∣ < ln \frac{\sum _{k} r _{k}^{^{'}}}{\sum _{k} r _{k}}

\frac{n _{k, \neg i}^{t} + β}{\sum _{t = 1}^{V} ( n _{k, \neg i}^{t} + β ) - N _{k}} \cdot \frac{n _{m, \neg i}^{k} + α}{\sum _{k = 1}^{K} ( n _{m, \neg i}^{k} + α )} b y \frac{a _{k}}{b _{k} - N _{k}}

\frac{n _{k, \neg i}^{t} + β}{\sum _{t = 1}^{V} ( n _{k, \neg i}^{t} + β ) - N _{k}} \cdot \frac{n _{m, \neg i}^{k} + α}{\sum _{k = 1}^{K} ( n _{m, \neg i}^{k} + α )} b y \frac{a _{k}}{b _{k} - N _{k}}

a_{k} = (n_{k, \neg i}^{t} + β) \cdot \frac{n _{m, \neg i}^{k} + α}{\sum _{k = 1}^{K} ( n _{m, \neg i}^{k} + α )} b_{k} = t = 1 \sum V (n_{k, \neg i}^{t} + β)

a_{k} = (n_{k, \neg i}^{t} + β) \cdot \frac{n _{m, \neg i}^{k} + α}{\sum _{k = 1}^{K} ( n _{m, \neg i}^{k} + α )} b_{k} = t = 1 \sum V (n_{k, \neg i}^{t} + β)

j \neq = k \sum \frac{a _{j}}{b _{j}} + \frac{a _{k}}{b _{k} - N} > j = 1 \sum K \frac{a _{j}}{b _{j} - N _{j}}

j \neq = k \sum \frac{a _{j}}{b _{j}} + \frac{a _{k}}{b _{k} - N} > j = 1 \sum K \frac{a _{j}}{b _{j} - N _{j}}

\frac{a _{k}}{b _{k} - N} = max {\frac{a _{j}}{b _{j} - N}, j \in {1, 2, ..., K}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Privacy, Security, and Data Protection · Human Mobility and Location-Based Analysis

MethodsLinear Discriminant Analysis

Full text

On Privacy Protection of Latent Dirichlet Allocation Model Training

Fangyuan Zhao1,2,3, Xuebin Ren1,2, Shusen Yang2,3, Xinyu Yang1,2 1School of Computer Science and Technology, Xi’an Jiaotong University, China

2National Engineering Laboratory for Big Data Analytics, Xi’an Jiaotong University, China

3Ministry of Education Key Lab For Intelligent Networks and Network Security, Xi’an Jiaotong University, China

[email protected], [email protected], [email protected], [email protected]

Abstract

Latent Dirichlet Allocation (LDA) is a popular topic modeling technique for discovery of hidden semantic architecture of text datasets, and plays a fundamental role in many machine learning applications. However, like many other machine learning algorithms, the process of training a LDA model may leak the sensitive information of the training datasets and bring significant privacy risks. To mitigate the privacy issues in LDA, we focus on studying privacy-preserving algorithms of LDA model training in this paper. In particular, we first develop a privacy monitoring algorithm to investigate the privacy guarantee obtained from the inherent randomness of the collapsed gibbs sampling (CGS) process in a typical LDA training algorithm on centralized curated datasets. Then, we further propose a locally private LDA training algorithm on crowdsourced data to provide local differential privacy for individual data contributors. The experimental results on real-world datasets demonstrate the effectiveness of our proposed algorithms.

1 Introduction

Massive text data have arisen in the sustained and rapid development of Internet. Mining and analyzing of text data can help us gain a vast amount of knowledge, thus benefiting the whole society. As a fundamental model for text mining, Latent Dirichlet Allocation(LDA) Blei et al. (2003) can be used for discovering the main features of the sparse text datasets by identifying their hidden semantic architecture. Particularly, LDA can map the high-dimensional text data to a low-dimensional topic space while retaining the implicit semantics, which has been an effective machine learning technique for clustering or classification. Many enterprises such as Yahoo Smola and Narayanamurthy (2010), Tencent Wang et al. (2014)Yut et al. (2017), and Microsoft Yuan et al. (2015) have all built LDA platforms for supporting big data analysis and training machine learning models on various text data.

Similar to other machine learning models, LDA may be trained on the datasets that contain some sensitive information of individuals and will inevitably memorize some knowledge about the datasets. Unfortunately, aiming at this characteristic, some attacks have been proposed to extract the private information of the training data from machine learning models. For example, membership inference attacks (MIA)Shokri et al. (2017) can be launched to infer the membership information of an individual. Model inversion attacks Fredrikson et al. (2014) have been proved to be able to extract training data from observed model predictions. Therefore, despite the popularity and effectiveness, the naive LDA model may also suffer from these attacks and lead to great privacy risks.

Differential privacy proposed by Dwork Dwork et al. (2006) has been the de-facto standard of privacy protection with a rigorous mathematical proof. Due to its strong privacy guarantee, DP has also been exploited in many fields such as data publication Ren et al. (2018)Li et al. (2019)and machine learning Chaudhuri et al. (2011)Abadi et al. (2016) as well as LDA training Park et al. (2016)Zhu et al. (2016). For example, Park et al. Park et al. (2016) proposed to obtain privacy guarantee for LDA models by perturbing the expected sufficient statistics in each iteration of the variational Bayesian method, which is a parameter estimation algorithm for LDA. Zhu et. al. Zhu et al. (2016) presented a differentially private LDA algorithm by perturbing the sampling distribution in the collapsed Gibbs sampling(CGS) process, which is a typical training algorithm for LDA.

Both the above algorithms achieve DP by injecting extra noise to the training process of LDA regarding to centralized training datasets. However, as a typical sampling algorithm with inherent randomness, CGS possesses uncertainty in its execution and naturally provides some level of privacy guarantee, which has been indicated in Wang et al. (2015)Foulds et al. (2016). In particular, Wang et al. Wang et al. (2015) proved that posterior sampling and the stochastic gradient Markov chain Monte Carlo techniques possess some inherent privacy guarantee. Foulds et al. Foulds et al. (2016) further extended this conclusion to the general MCMC methods. Besides the inherent privacy, both existing algorithms consider the LDA model training on centralized datasets owned by a trustworthy data curator. Nevertheless, due to privacy concerns, individual data contributors may be reluctant to directly share their sensitive data but prefer to send the locally sanitized data to the model trainer.

Therefore, aiming to provide strong privacy guarantee for LDA model training, this paper not only investigates to utilize the inherent privacy of CGS in LDA training on centralized datasets, but also proposes a locally private version of LDA that can be trained on crowd-sourced datasets with local sanitations. The contributions are summarized as follows:

•

We develop a privacy monitoring algorithm to measure the inherent privacy guarantee of CGS algorithm in LDA. In particular, we first define two different levels of privacy: document level and word level, and present the corresponding lower bound of privacy guarantee after a given number of iterations.

•

We propose LP-LDA, a novel mechanism that supports training a LDA model on crowd-sourced datasets with local sanitation, which can provide the guarantee of local privacy for individual data contributors.

•

We conduct experiments on several real-world datasets to demonstrate the effectiveness of our proposed algorithms. Particularly, experimental results show that our LP-LDA can achieve a high model training accuracy while providing sufficient local privacy guarantee.

The rest of paper is organized as follows. Section 2 reviews the preliminaries. Section 3 describes our algorithms in detail. The experiments are presented in Section 4. Finally, we conclude the paper in Section 5.

2 Preliminaries

2.1 LDA and Collapsed Gibbs Sampling

LDA model was first proposed by David Blei Blei et al. (2003) in 2003 for analyzing the implicit semantic architecture of a corpus. In LDA model, any document $m$ in a corpus $D$ can be described by different distributions on $K$ latent topics, where each topic $k$ can be represented by a distribution on all words. LDA assumes the generative process of the documents in the corpus $D$ as follows:

for each topic $k$ , draw a “topic-word” distribution $\phi_{k}$ on all words $t$ from Dirichlet( $\beta\vec{1}$ ), where $\beta$ is the hyperparameter for the Dirichlet priors and can be interpreted as the prior observation for the “topic-word” count. 2. 2.

for each document $m$ , draw a “document-topic” distribution $\theta_{m}$ from Dirichlet( $\alpha\vec{1}$ ), where $\alpha$ is a hyperparameter similar to $\beta$ and represents the prior observation for the “document-topic” count. 3. 3.

for each word $w$ in a document $m$ , first draw a topic k from $\theta_{m}$ , and then draw a word t from $\phi_{k}$ .

The essence of training a LDA model is to estimate the parameters $\phi_{k}$ for a given corpus $D$ . The collapsed Gibbs sampling is such an effective parameter estimation algorithm. It iterates over each word $w_{i}$ and samples new topic $z_{i}$ for $w_{i}$ based on this full conditional distribution

[TABLE]

where $\neg i$ denotes the whole words except word $w_{i}$ , $n_{k}^{t}$ denotes the count of topic $k$ assigned to word $t$ and $n_{m}^{k}$ denotes the count of topic $k$ appeared in document $m$ which are maintained in matrices $N_{k}^{t}$ and $N_{m}^{k}$ respectively.

After multiple rounds of sampling over the whole corpus, the topic sample of each word can be obtained. And the parameter $\phi_{k}$ can be estimated by its posterior expectation

[TABLE]

The detailed procedures of CGS can be referred to Heinrich (2005).

2.2 Differential Privacy

Differential privacy proposed by Dwork Dwork et al. (2006) has been the de-facto standard of privacy protection with a rigorous mathematical proof. The rationale of DP guarantee is that negligible information can be gained by manipulating the output of a query on neighboring datasets.

Definition 1.

Dwork et al. (2006)** (Differential Privacy) A randomized mechanism $M:D\rightarrow Y$ is $\varepsilon\text{-differential private}$ if for any neighboring datasets $D,D^{{}^{\prime}}$ that satisfying $|D\Delta D^{\prime}|=1$ and any output $S\subseteq Y$ :

[TABLE]

2.3 Local Privacy

Differential privacy implicitly assumes a centralized dataset owned by a trustworthy curator and does not ensure the privacy guarantee for individual data contributors. Recently, local (differential) privacy has been proposed to provide data sanitization at the individual users’ side instead of the central server side.

Definition 2.

Dwork et al. (2014)**(Local Privacy) A randomized function $f$ satisfies $\varepsilon$ -local privacy if and only if for any two input tuples $t$ and $t^{\prime}$ in the domain of $f$ , and for any output $t^{*}$ of $f$ , there is:**

[TABLE]

3 Our Approach

In this section, we first investigate the inherent privacy of CGS process in LDA for a non-sanitized dataset owned by a trustworthy curator. Then, as a complement to the privacy guarantee of the data acquisition period, a locally private mechanism LP-LDA is presented to realize LDA model training on a sanitized dataset by local users.

3.1 Privacy monitoring Algorithm

3.1.1 Inherent Privacy of CGS

Generally, DP is achieved on most machine learning algorithms by introducing extra noise or randomness, which will inevitably cause a utility loss of the trained model. However, it has been shown in Foulds et al. (2016) that some degree of inherent DP can be obtained on Gibbs sampling algorithm for free. This is because each sampling process in Gibbs sampling works in a way the same as an exponential mechanism, which is a classic method to achieve DP. Obviously, as one version of Gibbs sampling, collapsed Gibbs sampling naturally inherits this property. Furthermore, such a property can also provide privacy for free in the CGS-based LDA training process. Therefore, aiming to utilize the inherent privacy, we develop a privacy monitoring algorithm to quantify the privacy guarantee of CGS in the LDA training process. In particular, the rationale behind the privacy monitoring algorithm is to find an adequate exponential mechanism for each sampling process in CGS and then accumulate the total privacy guarantee of all exponential mechanisms according to the composition theorem of DP.

3.1.2 Document-level privacy and word-level privacy

This paper considers to provide DP for the individual words and documents in the training corpus for LDA, respectively.

Word-level privacy: Let $D=\{w_{1},w_{2},...w_{W}\}$ denote a corpus with $|D|=W$ words $w_{i}~{}(i=1,2,...,W)$ . Then, its neighboring dataset $D^{\prime}$ satisfying $|D\Delta D^{\prime}|=1$ differs from $D$ by a single word $w$ . Word-level privacy prevents membership inference of individual words of the training corpus from the trained LDA model.

Document-level privacy: Let $D=\{m_{1},m_{2},...m_{M}\}$ denote a corpus with $|D|=M$ documents $m_{i}~{}(i=1,2,...,M)$ . Then, its neighboring dataset $D^{\prime}$ satisfying $|D\Delta D^{\prime}|=1$ differs from $D$ by a single document $m$ . In order to bound the sensitivity, we assume that a single document includes at most $N_{max}$ words. Document-level privacy prevents re-identification of individual documents in the training dataset of LDA, which may be contributed by and associated with individual users.

3.1.3 Inherent privacy in each sampling

To begin with, we show the essence of the intrinsic privacy guarantee in each sampling of CGS in terms of exponential mechanism. Consider the sampling process for word $w_{i}$ in the $n$ th iteration. Suppose its sampling distribution on $K$ topics is given by $\mathbf{P}=(p_{1},p_{2},...,p_{K})^{\top}$ , where $p_{k}$ denotes the probability that topic $k$ is assigned to $w_{i}$ in this sampling. Then we can rewrite $p_{k}$ as

[TABLE]

which could be understood as an output probability of an exponential mechanism $M_{E}(w_{i},u,\mathcal{K})$ that selects the topic $k\in\mathcal{K}$ with probability of $p_{k}$ . The utility function of $M_{E}(w_{i},u,\mathcal{K})$ is $u(w_{i},k)=\ln p_{k}$ and its sensitivity is $\Delta\ln p_{k}$ . Obviously, $\varepsilon=2\Delta\ln p_{k}$ is the intrinsic privacy guarantee of the exponential mechanism $M_{E}(w_{i},u,\mathcal{K})$ .

3.1.4 Privacy monitoring for each sampling

Unfortunately, it’s intractable to specify an exact value of $2\Delta\ln p_{k}$ in the execution process of CGS algorithm in LDA due to the complicated architecture of training corpus, hence we attempt to find an upper bound of $2\Delta\ln p_{k}$ to quantify the privacy guarantee $\varepsilon$ .

According to Equation (1), the sampling distribution $\mathbf{P}$ for word $w_{i}=t$ in $D$ in the $n$ -th iteration could be computed by

[TABLE]

Suppose that $\mathbf{P^{\prime}}=(p^{\prime}_{1},p^{\prime}_{2},...,p^{\prime}_{K})^{\top}$ is the corresponding distribution on $D^{\prime}$ , which is the neighboring dataset of $D$ , then

[TABLE]

where $N_{k}$ denotes the count of topic $k$ assigned in the $D-D^{\prime}$ where $k\in\{1,2,...,K\}$ . We refer to $\{N_{1},N_{2},...,N_{k}\}$ as a topic partition on $D-D^{\prime}$ and $\sum N_{k}=|D-D^{\prime}|$ .

Given a topic partition $\gamma=\{N_{1},N_{2},...,N_{k}\}$ , the privacy guarantee in this sampling process could be measured by

[TABLE]

where $\xi_{k}$ denotes the sensitivity of $\ln{p_{k}}$ . However, there are $\binom{N+K-1}{K-1}$ partitions in total. So, it is computational prohibitive to find the maximal $\varepsilon_{\gamma}$ among all partitions. In the following, we consider how to reduce the searching space of partitions.

For simplicity, we first consider a special case, in which there exists some topic i with $N_{i}=0$ in a given partition.

Theorem 1.

Suppose that there exists some $N_{k}=0$ in a given partition $\gamma=\{N_{1},N_{2},...,N_{K}\}$ , then the privacy guarantee

[TABLE]

if and only if for any $j\neq k$ **

[TABLE]

Proof.

See Appendix A for details. ∎

Corollary 1.

Suppose that there exists a topic set $\mathcal{T}=\{k,...,j\}$ with $\{N_{j}\neq 0,\forall j\in\mathcal{T}\}$ in a given partition $\gamma$ , and it holds that

[TABLE]

for some $k\in\mathcal{T}$ , then the privacy guarantee**

[TABLE]

Proof.

This proof follows from the result of Theorem 1. ∎

Theorem 1 and corollary 1 illustrate a special case to find the privacy $\varepsilon_{\gamma}$ . The following lemma and theorem further demonstrate that among all the partitions, the one with the largest privacy guarantee belongs to a partitions set $\mathcal{P}=\{\gamma|\exists k,s.t.~{}N_{k}=N,N_{j}=0,\forall j\neq k\}$ .

Lemma 1.

There exists a partition $\gamma^{*}$ in

[TABLE]

such that

[TABLE]

where $\Gamma$ denotes the set consisting of all the partitions.

Proof.

See Appendix B for details. ∎

Definition 3.

(Pseudo sampling distribution) Suppose that given a vector $\mathbf{q}$ with length K, each component

[TABLE]

Then $\mathbf{q}$ is the pseudo sampling distribution in this sampling.

Theorem 2.

Among all the partitions, there must exist a partition $\gamma^{\prime}$ in

[TABLE]

such that

[TABLE]

if condition (4) holds. $k$ is the topic index such that $|r_{k}-q_{k}|=\left\|\mathbf{r}-\mathbf{q}\right\|_{\infty}$ , $\mathbf{q}$ is the pseudo sampling distribution.

Proof.

See Appendix C for details. ∎

Theorem 2 indicates that only the partitions in $\mathcal{P}$ need to be considered for computing the privacy $\varepsilon$ in the each sampling processing, which greatly reduce the searching scope. In particular, if condition (4) holds for all partitions in $\mathcal{P}$ , the privacy guarantee could be computed directly by Equation (8), which is the first case to consider. If not, for any partition $\gamma$ in $\mathcal{P}$ not satisfying condition (4), the privacy guarantee could be computed by Equation (5). Due to the arbitrariness of $\gamma$ , we have another $K$ cases to consider since there are $K$ partitions in $\mathcal{P}$ . Furthermore, since whether condition (4) holds is unknown, we have to enumerate all these $K+1$ cases to find the privacy guarantee bound. Algorithm 1 presents the searching-based algorithm for monitoring the privacy guarantee of each sampling for each word.

3.1.5 Privacy monitoring for LDA

So far, the privacy guarantee $\varepsilon_{w}^{i}$ of the sampling process for word $w$ in the $i$ th iteration can be measured by Algorithm 1. Since the sampling process of the whole CGS algorithm is iteratively performed for each word but alternatively among all the words in the corpus, the total privacy guarantee of the whole CGS process in LDA training could be computed according to the composition theorems of DP.

Theorem 3.

Given a corpus $D$ , suppose the CGS algorithm performed on word $w$ at the $i$ -th iteration satisfies $\varepsilon_{w}^{i}$ -DP, then after $n$ iterations, the whole CGS algorithm performed on $D$ satisfies $\max_{w}\{\sum_{i=1}^{n}\varepsilon_{w}^{i}\}$ -DP.

Proof.

For any word $w$ , after $n$ iterations of sampling, it will be accessed to by the whole CGS process $n$ times, according to the sequential composition theorem Li et al. (2016), the total privacy guarantee for word $w$ in the CGS algorithm is $\varepsilon_{w}=\sum_{i=1}^{n}\varepsilon_{w}^{i}$ . While, according to the Equation (1), each iteration of CGS in LDA only accesses to each word once to perform the sampling, then according to the parallel composition theorem Li et al. (2016), the total privacy guarantee for the copus(all words) should be the maximum privacy guarantee of CGS among all words, that is $\max_{w}\{\sum_{i=1}^{n}\varepsilon_{w}^{i}\}$ . ∎

Based on this observation, Algorithm 2 shows the privacy monitoring algorithm for the whole CGS process in LDA.

3.2 LP-LDA

As analyzed above, CGS algorithm can intrinsically guarantee the privacy of individual documents for the LDA model trained on a plain-text dataset, which is owned by a trustworthy curator. However, in many distributed applications, data servers are not always privacy-reliable and data owners may not be willing to directly contribute their sensitive data. In this case, we further propose a hidden-data based LDA mechanism LP-LDA that can perform the training process on a sanitized dataset with local privacy. In particular, the LP-LDA mechanism mainly consists of two components: local perturbation at the user side and training on reconstructed dataset at the server side.

3.2.1 Local perturbation

The local perturbation at the user side includes the following steps:

•

Step 1. Each document $m$ is encoded as a binary vector $\mathbf{V}_{m}$ , in which each bit $\mathbf{V}_{m}[j]$ represents the presence of the $j$ -th word in the word bag of the corpus.

•

Step 2. Each bit $\mathbf{V}_{m}[j]$ of the binary vector $\mathbf{V}_{m}$ is then randomly flipped according to the following randomized response rule:

[TABLE]

where $f\in[0,1]$ is a parameter that specifies the randomness of flipping and adjusts the local privacy level.

•

Step 3. Then the noisy binary vector $\hat{\mathbf{V}}_{m}[j]$ is sent to the central server by each user. Obviously, $\hat{\mathbf{V}}_{m}[j]$ is locally sanitized without concerning user’s privacy.

3.2.2 Training on reconstructed dataset

After receiving the flipped binary vectors from a large number of data contributors, the central server can aggregate the vectors, reconstruct the dataset and then perform training on the reconstructed dataset. The rationale behind this is that the training result of topic-word distribution is insensitive to the document partitions and only depends on the total word counts in the corpus.

•

Step 1. For each bit in the noisy binary vectors, the server counts the number of $1^{\prime}$ s as $n_{t}=\sum_{i=1}^{M}\hat{\mathbf{V}}_{m}[t]$ .

•

Step 2. The server then estimates the true count $N_{t}$ of each bit in the original binary vectors $\mathbf{V}_{m}$ as $\hat{N}_{t}=(2n_{t}-fM)/2(1-f)$ .

•

Step 3. For each bit, the server first computes the difference $\delta_{t}=\hat{N}_{t}-n_{t}$ .

•

Step 4. For each bit $t$ , if $\delta_{t}>0$ , the server randomly samples $\delta_{t}$ binary vectors with the $t$ -th bit as [math] and sets the $t$ -th bit as $1$ ; if $\delta_{t}<0$ , then the server randomly samples $|\delta_{t}|$ binary vectors with the $t$ -th bit as $1$ and sets the $t$ -th bit as [math]; otherwise, keeps the noisy bit vectors as received.

•

Step 5. Based on the noisy bit vectors, the server reconstructs a dataset and performs the CGS process on it.

3.2.3 Privacy Analysis of LP-LDA

Theorem 4.

The LP-LDA satisfies $\varepsilon\text{-differential privacy}$ for each document contributor where $\varepsilon=\ln\frac{1-f/2}{f/2}$ .

Proof.

Suppose a word $t$ appears in a noisy bit vector, then the probability of it being kept from the original bit vector is $Pr(\hat{\mathbf{V}}_{m}[t]=1|\mathbf{V}_{m}[t]=1)=1-f/2$ and the probability of it being flipped from the original bit vector is $Pr(\hat{\mathbf{V}}_{m}[t]=1|\mathbf{V}_{m}[t]=0)=f/2$ . Then, according to the definition of DP, it guarantees the privacy of

[TABLE]

The analysis also holds for any bit $t$ that $\hat{\mathbf{V}}_{m}[t]=0$ . ∎

Since the reconstruction and training process are essentially post-processes on the noisy bit vectors, the local privacy remains unchanged for all the documents.

3.2.4 Utility Analysis of LP-LDA

Theorem 5.

Let $N_{t}$ and $n_{t}$ denote the counts of word $t$ in the original and perturbed datasets, respectively, then

[TABLE]

is an unbiased estimator of $N_{t}$ with the variance of

[TABLE]

Proof.

Let $n_{1}$ denote the count of word $t$ retained from the real datasets and $n_{2}$ denote the noisy part, then $n_{1}$ and $n_{2}$ follow two Binomial distributions, i.e., $n_{1}\sim B(N_{t},1-f/2)$ , $n_{2}\sim B(M-N_{t},f/2)$ . Let $X=n_{1}+n_{2}$ , then its first theoretical moment $E(X)=N_{t}(1-f/2)+(M-N_{t})\cdot(f/2)$ and its first sample moment $\bar{X}=n_{t}$ . Therefore,

[TABLE]

is the moment estimator as well as unbiased estimator. Its variance is then

[TABLE]

∎

4 Experiment

In this section, we evaluate the effectiveness of our proposed privacy monitoring algorithm and locally private LDA algorithm LP-LDA on real-world datasets.

The datasets used in our experiment are: KOS111http://archive.ics.uci.edu/ml/: contains 3430 blog entries from dailykos website. NIPS222http://nips.djvuzone.org/txt.html: contains 1740 research papers from NIPS conference. Enron333www.cs.cmu.edu/ enron: contains 0.5 million email messages from about 150 users.

We extracted part of these datasets as our training datasets and the rest as the testsets. For simplicity, we setup a pre-processing phase on these dataset before running our experiments. For example, all stop words were removed and 1000 most frequent words in each dataset were chosen as the corresponding vocabulary list. Details about these datasets after pre-processing can be found in Table 1.

In our experiments, for all datasets, the topic number is set as $50$ , the maximum iteration number of CGS process in LDA model training is set as $300$ , which is sufficient for convergence on all three datasets. The hyper parameters $\alpha$ and $\beta$ are set as 0.1, 0.01, respectively.

4.1 Inherent privacy of CGS in LDA

Figure 1 illustrates the inherent privacy guarantee of CGS algorithm in LDA measured by our proposed privacy monitoring algorithm on three datasets for both document-level and word-level privacy. It should be noted that a larger privacy parameter $\varepsilon$ in the figures means less privacy guarantee.

As we can see in both subfigures, both word-level and document-level privacy parameter $\varepsilon$ of CGS in LDA increase approximately linearly with the number of sampling iterations. This is because the privacy bound in each iteration of sampling is very close, and the total privacy parameter will accumulate with the number of iterations according to the sequential composition theorem.

Although CGS on all datasets can obtain privacy guarantee for free, the inherent privacy varies on different datasets. For document level, the privacy guarantee achieved on NIPS is the weakest while that on Enron is the strongest. That is because the documents in NIPS averagely contain the most words, which also means it is the most difficult to be effectively hidden. For word-level privacy, the LDA model trained on NIPS has the strongest privacy guarantee because it contains largest number of words and the sampling probability for each unique word will be the lowest. On the contrary, with the fixed length of vocabulary list, KOS contains the fewest words in total and results the weakest word-level privacy after same number of iteration.

4.2 Local mechanism

Figure 2 depicts the simulation performance of our proposed LP-LDA mechanism in terms of different level of privacy. The flipping probability $f$ in LP-LDA varies from $0.5$ to $0.001$ , and the corresponding privacy level varies from $1.089$ to $7.6004$ . The utility of LDA model training is measured by the perplexity on test sets. Perplexity is an information-theoretical measure commonly used to evaluate the prediction performance of LDA model and generally smaller perplexity on a test set means better prediction accuracy. In particular, we compared LP-LDA with a baseline privacy-preserving LDA mechanism based on Laplace mechanism, in which the sufficient statistics of the likelihood, i.e., word count matrices $N_{k}^{t}$ and $N_{m}^{k}$ are privatized at the beginning of the CGS algorithm with the sensitivity of $1$ and privacy of $\varepsilon$ Foulds et al. (2016).

As shown, both the perplexity of LP-LDA and baseline algorithm decrease with the increase of $\varepsilon$ , which shows the trade-off between the privacy and utility. For stronger privacy regime with smaller $\varepsilon$ , the perplexity of LP-LDA is larger than that of the baseline algorithm. That is because the Laplace mechanism baseline algorithm incurs less noise than randomized response in LP-LDA for the statistics of word count $N_{t}$ . While for weaker privacy regime with larger $\varepsilon$ , the perplexity of LP-LDA is far less than that of the baseline algorithm and shows greater LDA model training utility. These utility comparison results can be also explained by the variance difference of the word count $N_{t}$ in two mechanisms. In baseline mechanism based on Laplace noise, the noise variance is $D(N^{\prime}_{t})=2K^{2}/\varepsilon^{2}$ , while the variance $D(\hat{N}_{t})$ in our proposed LP-LDA is shown in Equation (10). In particular, for larger $\varepsilon$ on all three datasets, we can always have $D({\hat{N}_{t}})<D(N^{\prime}_{t})$ .

5 Conclusion and future work

In this work, we investigate the privacy protection of LDA model training. We first present that the CGS algorithm in LDA can possess some inherent privacy in each sampling process and then propose a efficient searching-based privacy monitoring algorithm to identify the privacy guarantee bound in the iterative CGS process of LDA. In addition, besides training on a trustworthy data server, we also propose a locally private solution of LP-LDA to achieve LDA training on a sanitized dataset by individual local users, which is applicable to many scenarios. The experiments on real-world datasets validate our proposed approaches. Future work will center on finding tighter bound of the inherent privacy guarantee in LDA model training.

6 Acknowledgments

This work is supported by National Natural Science Foundation of China under Grant 61572398, Grant 61772410, Grant 61802298, Grant 11690011 and Grant U1811461, in part by the Fundamental Research Funds for the Central Universities under Grant xjj2018237, in part by the China Postdoctoral Science Foundation under Grant 2017M623177, and in part by the National Key Research and Development Program of China under Grant 2017YFB1010004.

Appendix A Proof of Theorem 1

Proof.

The sensitivity obtained from topic j can be computed as

[TABLE]

then we compare $\xi_{j}$ and $\xi_{k}$

[TABLE]

By condition (4), it’s easy to prove that

[TABLE]

We can observe that $\xi_{k}=\xi_{j}$ when $N_{k}=N_{j}=0$ , and $\xi_{k}>\xi_{j}$ when $N_{k}=0,N_{j}\neq 0$ and condition (11) holds. Hence, $\xi_{k}=\max\{\xi_{1}\,\xi_{2}\,...,\xi_{K}\}$ holds due to the arbitrariness of k and j. ∎

Appendix B Proof of Lemma 1

Proof.

For convenience, we denote

[TABLE]

where

[TABLE]

then the problem is transformed into proving that

[TABLE]

holds if

[TABLE]

inequality (12) is equivalent to

[TABLE]

To prove inequality (13), we consider a function set

[TABLE]

where

[TABLE]

then each function in $\mathcal{Y}$ is determined by a pair of parameters $(a_{j},b_{j})$ . consider the relation between $(a_{j},b_{j})$ and $(a_{i},b_{i})$ where $i,j\in\{1,2,...,K\}$ , it must belongs to one of two cases below:

[TABLE]

Case1:It must holds that $y_{i}(N_{i})+y_{j}(N_{j})<y_{i}(N_{i})+y_{i}(N_{j})<y_{i}(N_{i}+N_{j})$ since $y_{i}(0)=y_{j}(0)$ and $y_{i}^{{}^{\prime}}(x)>y_{j}^{{}^{\prime}}(x),\forall x>0$

Case2:It must holds that

[TABLE]

In fact, it is easy to prove that there exists only one intersection in $(0,\min\{b_{i},b_{j}\})$ between $y_{i}(x)\quad and\quad y_{j}(x)$ , denoted by $(x^{*},y^{*})$ . Based on this, the distribution of $N_{i},N_{j},N_{i}+N_{j}$ on number axis also has three cases to consider:

case1: $N_{j}\leq x^{*},N_{i}+N_{j}>x^{*}$ , by computing the derivatives of $y_{i}^{{}^{\prime}}(x)$ and $y_{j}^{{}^{\prime}}(x)$ , we have

[TABLE]

case2: $N_{j}>x^{*},N_{i}+N_{j}>x^{*}$ , since $y_{j}(N_{j})<y_{i}(N_{j})$ holds, then:

[TABLE]

case3: $N_{i}+N_{j}\leq x^{*}$ , since $y_{i}(N_{i})<y_{j}(N_{i})$ holds, then:

[TABLE]

We have proved that there must exists function $y_{i}(x)$ such that $y_{i}(N)>\sum_{j=1}^{K}y_{j}(N_{j})$ through the research above on the property of $y_{i}(x)$ . So far, Lemma 1 has been proved. ∎

Appendix C Proof of Theorem 2

Proof.

Given a partition $\gamma^{*}=\{N_{1},...N_{K}\}$ not in $\mathcal{P}$ , it suffices to verify that there exist some partitions from $\mathcal{P}$ such that the privacy parameter obtained from $\gamma^{*}$ is smaller than parameter from those partitions. Assume that the privacy parameter from $\gamma^{*}$ is $\varepsilon=2\xi=2\max\{\xi_{1}\,\xi_{2}\,...,\xi_{K}\}$ ,then there are two cases need to be considered:

[TABLE]

*Case 1:*Based on lemma 1, there exists a partition $\gamma^{{}^{\prime}}$ from $\mathcal{P}$ such that $\sum_{k}r_{k}^{{}^{\prime}}=\max_{\gamma}\{\sum_{k}r_{k}^{{}^{\prime}}|\gamma\}$ ,and since there exists $N_{j}=0$ in $\gamma^{{}^{\prime}}$ , then according to corollary 1

[TABLE]

*Case 2:*Based on theorem 1, elements in $\{\xi_{1}\,\xi_{2}\,...,\xi_{K}\}$ satisfy

[TABLE]

Since $N_{j}\ll\sum_{t}n_{j}^{t}$ always holds, especially for a large corpus, then it’s not hard to deduce that

[TABLE]

until now, we have proved the existence of the $\gamma^{{}^{\prime}}$ . According to theorem 1, if condition (4) holds for each $k\in\{1,2,...,K\}$ with $N_{k}=N$ , then equation (8) will hold directly. So far, theorem 2 has been proved completely. ∎

Bibliography19

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Abadi et al. [2016] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan Mc Mahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security , pages 308–318. ACM, 2016.
2Blei et al. [2003] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of machine Learning research , 3(Jan):993–1022, 2003.
3Chaudhuri et al. [2011] Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate. Differentially private empirical risk minimization. Journal of Machine Learning Research , 12(Mar):1069–1109, 2011.
4Dwork et al. [2006] Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference , pages 265–284. Springer, 2006.
5Dwork et al. [2014] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science , 9(3–4):211–407, 2014.
6Foulds et al. [2016] James Foulds, Joseph Geumlek, Max Welling, and Kamalika Chaudhuri. On the theory and practice of privacy-preserving bayesian data analysis. ar Xiv preprint ar Xiv:1603.07294 , 2016.
7Fredrikson et al. [2014] Matthew Fredrikson, Eric Lantz, Somesh Jha, Simon Lin, David Page, and Thomas Ristenpart. Privacy in pharmacogenetics: An end-to-end case study of personalized warfarin dosing. In 23rd { { \{ USENIX } } \} Security Symposium ( { { \{ USENIX } } \} Security 14) , pages 17–32, 2014.
8Heinrich [2005] Gregor Heinrich. Parameter estimation for text analysis. Technical report, Technical report, 2005.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

On Privacy Protection of Latent Dirichlet Allocation Model Training

Abstract

1 Introduction

2 Preliminaries

2.1 LDA and Collapsed Gibbs Sampling

2.2 Differential Privacy

Definition 1**.**

2.3 Local Privacy

Definition 2**.**

3 Our Approach

3.1 Privacy monitoring Algorithm

3.1.1 Inherent Privacy of CGS

3.1.2 Document-level privacy and word-level privacy

3.1.3 Inherent privacy in each sampling

3.1.4 Privacy monitoring for each sampling

Theorem 1**.**

Proof.

Corollary 1**.**

Proof.

Lemma 1**.**

Proof.

Definition 3**.**

Theorem 2**.**

Proof.

3.1.5 Privacy monitoring for LDA

Theorem 3**.**

Proof.

3.2 LP-LDA

3.2.1 Local perturbation

3.2.2 Training on reconstructed dataset

3.2.3 Privacy Analysis of LP-LDA

Theorem 4**.**

Proof.

3.2.4 Utility Analysis of LP-LDA

Theorem 5**.**

Proof.

4 Experiment

4.1 Inherent privacy of CGS in LDA

4.2 Local mechanism

5 Conclusion and future work

6 Acknowledgments

Appendix A Proof of Theorem 1

Proof.

Appendix B Proof of Lemma 1

Proof.

Appendix C Proof of Theorem 2

Proof.

Definition 1.

Definition 2.

Theorem 1.

Corollary 1.

Lemma 1.

Definition 3.

Theorem 2.

Theorem 3.

Theorem 4.

Theorem 5.