On Differentially Private Online Predictions

Haim Kaplan; Yishay Mansour; Shay Moran; Kobbi Nissim; Uri Stemmer

arXiv:2302.14099·cs.LG·March 1, 2023

On Differentially Private Online Predictions

Haim Kaplan, Yishay Mansour, Shay Moran, Kobbi Nissim, Uri Stemmer

PDF

Open Access

TL;DR

This paper introduces an interactive variant of joint differential privacy tailored for online processes, demonstrating its favorable properties and showing that private online learning can be achieved with only polynomial overhead in mistake bounds.

Contribution

It proposes a new interactive joint differential privacy definition and proves that it allows private online learning with polynomial mistake overhead, unlike previous more restrictive notions.

Findings

01

Interactive joint privacy satisfies group privacy, composition, and post-processing.

02

Any online learning rule can be privatized with polynomial mistake overhead.

03

Contrasts with previous notions requiring double exponential overhead.

Abstract

In this work we introduce an interactive variant of joint differential privacy towards handling online processes in which existing privacy definitions seem too restrictive. We study basic properties of this definition and demonstrate that it satisfies (suitable variants) of group privacy, composition, and post processing. We then study the cost of interactive joint privacy in the basic setting of online classification. We show that any (possibly non-private) learning rule can be effectively transformed to a private learning rule with only a polynomial overhead in the mistake bound. This demonstrates a stark difference with more restrictive notions of privacy such as the one studied by Golowich and Livni (2021), where only a double exponential overhead on the mistake bound is known (via an information theoretic upper bound).

Equations58

\mathcal{M}\bigl{(}\mathcal{A};S\bigr{)}=\sum_{t=1}^{T}1\{\hat{y}_{t}\neq y_{t}\},

\mathcal{M}\bigl{(}\mathcal{A};S\bigr{)}=\sum_{t=1}^{T}1\{\hat{y}_{t}\neq y_{t}\},

\tilde{O} (\frac{d ^{2}}{ε ^{2}} lo g^{2} (\frac{1}{δ}) lo g^{2} (\frac{T}{β}))

\tilde{O} (\frac{d ^{2}}{ε ^{2}} lo g^{2} (\frac{1}{δ}) lo g^{2} (\frac{T}{β}))

a_{i} - j = 1 \sum i x_{i} \leq O (\frac{1}{ε} lo g (T) lo g (\frac{T}{β})) .

a_{i} - j = 1 \sum i x_{i} \leq O (\frac{1}{ε} lo g (T) lo g (\frac{T}{β})) .

ε^{'} = 2 m ln (1/ δ^{'}) ε + m ε (e^{ε} - 1) .

ε^{'} = 2 m ln (1/ δ^{'}) ε + m ε (e^{ε} - 1) .

OnlineGame_{M, B, T, g} (0) \equiv W_{0} and OnlineGame_{M, B, T, g} (1) \equiv W_{g} .

OnlineGame_{M, B, T, g} (0) \equiv W_{0} and OnlineGame_{M, B, T, g} (1) \equiv W_{g} .

P (OnlineGame_{M, B, T} (0)) \approx_{(ε, δ)} P (OnlineGame_{M, B, T} (1)) .

P (OnlineGame_{M, B, T} (0)) \approx_{(ε, δ)} P (OnlineGame_{M, B, T} (1)) .

P (OnlineGame_{M, B, T} (0)) \equiv W_{ℓ - 1} and P (OnlineGame_{M, B, T} (1)) \equiv W_{ℓ},

P (OnlineGame_{M, B, T} (0)) \equiv W_{ℓ - 1} and P (OnlineGame_{M, B, T} (1)) \equiv W_{ℓ},

OnlineGame_{A, B, T, g} (0) \equiv W_{0} \approx_{(ε, δ)} W_{1} \approx_{(ε, δ)} W_{2} \approx_{(ε, δ)} \dots \approx_{(ε, δ)} W_{g} \equiv OnlineGame_{A, B, T, g} (1) .

OnlineGame_{A, B, T, g} (0) \equiv W_{0} \approx_{(ε, δ)} W_{1} \approx_{(ε, δ)} W_{2} \approx_{(ε, δ)} \dots \approx_{(ε, δ)} W_{g} \equiv OnlineGame_{A, B, T, g} (1) .

CATG (b) \approx_{(ε, 0)} CATG \scalebox 0.6 [1.0] - noCount (b),

CATG (b) \approx_{(ε, 0)} CATG \scalebox 0.6 [1.0] - noCount (b),

CATG \scalebox 0.6 [1.0] - AboveThrehold (b) \approx_{(0, δ)} CATG \scalebox 0.6 [1.0] - noCount (b) .

CATG \scalebox 0.6 [1.0] - AboveThrehold (b) \approx_{(0, δ)} CATG \scalebox 0.6 [1.0] - noCount (b) .

CATG \scalebox 0.6 [1.0] - AboveThrehold (b) \approx_{(0, δ)} CATG \scalebox 0.6 [1.0] - AboveThrehold (b) .

CATG \scalebox 0.6 [1.0] - AboveThrehold (b) \approx_{(0, δ)} CATG \scalebox 0.6 [1.0] - AboveThrehold (b) .

\approx_{(ε, 0)} CATG \scalebox 0.6 [1.0] - noCount (0)

\approx_{(ε, 0)} CATG \scalebox 0.6 [1.0] - noCount (0)

\approx_{(0, δ)} CATG \scalebox 0.6 [1.0] - AboveThrehold (0)

\approx_{(0, δ)} CATG \scalebox 0.6 [1.0] - AboveThrehold (0)

\approx_{(ε, δ)} CATG \scalebox 0.6 [1.0] - AboveThrehold (1)

\approx_{(0, δ)} CATG \scalebox 0.6 [1.0] - AboveThrehold (1)

\approx_{(0, δ)} CATG \scalebox 0.6 [1.0] - noCount (1)

\approx_{(ε, 0)} CATG(1)

OnlineGame_{POP, B} (0)

OnlineGame_{POP, B} (0)

\approx_{(ε, δ)} ChallengeAT-Game_{\hat{B} \raise - 1.50694 pt ∣_{B}} (1)

\approx_{(0, δ)} OnlineGame_{POP, B} (1) .

error_{CAT} = O (\frac{Δ}{ε} r + λ ln (\frac{r + λ}{δ}) lo g (\frac{T}{β})),

error_{CAT} = O (\frac{Δ}{ε} r + λ ln (\frac{r + λ}{δ}) lo g (\frac{T}{β})),

\mbox 1/5 - Err = ⎩ ⎨ ⎧ i \in [T] : j \in [k] \sum \mathbbm 1 {\overset{y}{^}_{i, j} \neq = y_{i}} > k /5 ⎭ ⎬ ⎫ .

\mbox 1/5 - Err = ⎩ ⎨ ⎧ i \in [T] : j \in [k] \sum \mathbbm 1 {\overset{y}{^}_{i, j} \neq = y_{i}} > k /5 ⎭ ⎬ ⎫ .

expertAdvance = ∣ {i \in [T] : y_{i} \neq = \overset{y}{^}_{i, ℓ_{i}}} ∣ .

expertAdvance = ∣ {i \in [T] : y_{i} \neq = \overset{y}{^}_{i, ℓ_{i}}} ∣ .

expertAdvance \leq k d .

expertAdvance \leq k d .

Pr [\mbox 1/5 - Err > 18 d k + 18 + ln (\frac{1}{β})] \leq β .

Pr [\mbox 1/5 - Err > 18 d k + 18 + ln (\frac{1}{β})] \leq β .

k = \tilde{O} (\frac{d}{ε ^{2}} lo g^{2} (\frac{1}{δ}) lo g^{2} (\frac{T}{β}) + \frac{1}{ε \cdot d} lo g (T) lo g (\frac{T}{δ})),

k = \tilde{O} (\frac{d}{ε ^{2}} lo g^{2} (\frac{1}{δ}) lo g^{2} (\frac{T}{β}) + \frac{1}{ε \cdot d} lo g (T) lo g (\frac{T}{δ})),

Pr [CoinGame_{B, k, m} > λ] \leq exp (- \frac{λ}{6} + 3 (k + 1)) .

Pr [CoinGame_{B, k, m} > λ] \leq exp (- \frac{λ}{6} + 3 (k + 1)) .

i = 1 \sum T ∣ \overset{y}{^}_{i} - y_{i} ∣ \leq O (M^{*} + Ldim (H) ln (T)) .

i = 1 \sum T ∣ \overset{y}{^}_{i} - y_{i} ∣ \leq O (M^{*} + Ldim (H) ln (T)) .

OPT_{i^{*}} ≜ h \in H min i = 1 \sum i^{*} ∣ h (x_{i}) - y_{i} ∣ > d \cdot ln (T) .

OPT_{i^{*}} ≜ h \in H min i = 1 \sum i^{*} ∣ h (x_{i}) - y_{i} ∣ > d \cdot ln (T) .

k \cdot WorstExpert \geq 1/5 -Err \geq OurError \geq u = Ω (k d ln (T)) .

k \cdot WorstExpert \geq 1/5 -Err \geq OurError \geq u = Ω (k d ln (T)) .

k \cdot WorstExpert \geq 1/5 -Err \geq NumTop \geq r = Ω (k d ln (T)) .

k \cdot WorstExpert \geq 1/5 -Err \geq NumTop \geq r = Ω (k d ln (T)) .

OPT ≜ h \in H min i = 1 \sum T ∣ h (x_{i}) - y_{i} ∣.

OPT ≜ h \in H min i = 1 \sum T ∣ h (x_{i}) - y_{i} ∣.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Internet Traffic Analysis and Secure E-voting · Cryptography and Data Security

Full text

On Differentially Private Online Predictions

Haim Kaplan Tel Aviv University and Google Research. [email protected]. Partially supported by Israel Science Foundation (grant 1595/19), and the Blavatnik Family Foundation.

Yishay Mansour Tel Aviv University and Google research. [email protected]. Work partially funded from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement No. 882396), by the Israel Science Foundation (grant number 993/17), Tel Aviv University Center for AI and Data Science (TAD), and the Yandex Initiative for Machine Learning at Tel Aviv University.

Shay Moran Departments of Mathematics and Computer Science, Technion and Google Research. [email protected]Shay Moran is a Robert J. Shillman Fellow; he acknowledges support by ISF grant 1225/20, by BSF grant 2018385, by an Azrieli Faculty Fellowship, by Israel PBC-VATAT, by the Technion Center for Machine Learning and Intelligent Systems (MLIS), and by the European Union (ERC, GENERALIZATION, 101039692). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them.

Kobbi Nissim Department of Computer Science, Georgetown University. [email protected]. Work partially funded by NSF grant No. 2001041 and by a gift to Georgetown University.

Uri Stemmer Tel Aviv University and Google research. [email protected]. Partially supported by the Israel Science Foundation (grant 1871/19) and by Len Blavatnik and the Blavatnik Family foundation.

(February 27, 2023)

Abstract

In this work we introduce an interactive variant of joint differential privacy towards handling online processes in which existing privacy definitions seem too restrictive. We study basic properties of this definition and demonstrate that it satisfies (suitable variants) of group privacy, composition, and post processing.

We then study the cost of interactive joint privacy in the basic setting of online classification. We show that any (possibly non-private) learning rule can be effectively transformed to a private learning rule with only a polynomial overhead in the mistake bound. This demonstrates a stark difference with more restrictive notions of privacy such as the one studied by Golowich and Livni (2021), where only a double exponential overhead on the mistake bound is known (via an information theoretic upper bound).

1 Introduction

In this work we introduce a new variant of differential privacy (DP) (Dwork et al., 2006), suitable for interactive processes, and design new online learning algorithms that satisfy our definition. As a motivating story, consider a chatbot that continuously improves itself by learning from the conversations it conducts with users. As these conversations might contain sensitive information, we would like to provide privacy guarantees to the users, in the sense that the content of their conversations with the chatbot would not leak. This setting flashes out the following two requirements.

(1)

Clearly, the answers given by the chatbot to user $u_{i}$ must depend on the queries made by user $u_{i}$ . For example, the chatbot should provide different answers when asked by user $u_{i}$ for the weather forecast in Antarctica, and when asked by $u_{i}$ for a pasta recipe.

This is in contrast to the plain formulation of differential privacy, where it is required that all of the mechanism outputs would be (almost) independent of any single user input. Therefore, the privacy requirement we are aiming for is that the conversation of user $u_{i}$ will remain “hidden” from other users, and would not leak through the other users’ interactions with the chatbot. Moreover, this should remain true even if a “privacy attacker” (aiming to extract information about the conversation user $u_{i}$ had) conducts many different conversations with the chatbot. 2. (2)

The interaction with the chatbot is, by design, interactive and adaptive, as it aims to conduct dialogues with the users. This allows the privacy attacker (mentioned above) to choose its queries to the chatbot adaptively. Privacy, hence, needs to be preserved even in the presence of adaptive attackers.

While each of these two requirements was studied in isolation, to the best of our knowledge, they have not been unified into a combined privacy framework. Requirement (1) was formalized by Kearns et al. (2015) as joint differential privacy (JDP). It provides privacy against non-adaptive attackers. Intuitively, in the chatbot example, JDP aims to hide the conversation of user $u_{i}$ from any privacy attacker that chooses in advance all the queries it poses to the chatbot. This is unsatisfactory since the adaptive nature of this process invites adaptive attackers.

Requirement (2) was studied in many different settings, but to the best of our knowledge, only w.r.t. the plain formulation of DP, where the (adaptive) privacy attacker sees all of the outputs of the mechanism. Works in this vein include (Dwork et al., 2009; Chan et al., 2010; Hardt and Rothblum, 2010; Dwork et al., 2010b; Bun et al., 2017; Kaplan et al., 2021; Jain et al., 2021). In the chatbot example, plain DP would require, in particular, that even the messages sent from the chatbot to user $u_{i}$ would reveal (almost) no information about $u_{i}$ . In theory, this could be obtained by making sure that the entire chatbot model is computed in a privacy preserving manner, such that even its full description leaks almost no information about any single user. Then, when user $u_{i}$ comes, we can “simply” share the model with her, and let her query it locally on her device. But is likely unrealistic with large models involving hundreds of billions of parameters.

In this work we introduce challenge differential privacy, which can be viewed as an interactive variant of JDP, aimed at maintaining privacy against adaptive privacy attackers. Intuitively, in the chatbot example, our definition would guarantee that even an adaptive attacker that controls all of the users except for user $u_{i}$ , learns (almost) no information about the conversation user $u_{i}$ had with the chatbot. We give the formal definition of challenge-DP in Section 3, after surveying the existing variants of differential privacy in Section 2. In addition, we show that challenge-DP is closed under post-processing, composition, and group-privacy (where the first two properties are immediate, and the third is more subtle).

1.1 Private Online Classification

We initiate the study of challenge differential privacy in the basic setting of online classification. Let $\mathcal{X}$ be the domain, $\mathcal{Y}$ be the label space, and $\mathcal{Z}=\mathcal{X}\times\mathcal{Y}$ be set of labeled examples. An online learner is a (possibly randomized) mapping $\mathcal{A}:\mathcal{Z}^{\star}\times\mathcal{X}\to\mathcal{Y}$ . That is, it is a mapping that maps a finite sequence $S\in\mathcal{Z}^{\star}$ (the past examples), and an unlabeled example $x$ (the current query point) to a label $y$ , which is denoted by $y=\mathcal{A}(x;S)$ .

Let $\mathcal{H}\subseteq\mathcal{Y}^{\mathcal{X}}$ be a hypothesis class. A sequence $S\in\mathcal{Z}^{\star}$ is said to be realizable by $\mathcal{H}$ if there exists $h\in\mathcal{H}$ such that $h(x_{i})=y_{i}$ for every $(x_{i},y_{i})\in S$ . For a sequence $S=\{(x_{t},y_{t})\}_{t=1}^{T}\in\mathcal{Z}^{\star}$ we write $\mathcal{M}(\mathcal{A};S)$ for the random variable denoting the number of mistakes $\mathcal{A}$ makes during the execution on $S$ . That is

[TABLE]

where $\hat{y}_{t}=\mathcal{A}(x_{t};S_{<t})$ is the (randomized) prediction of $\mathcal{A}$ on $x_{t}$ .

Definition 1.1 (Online Learnability: Realizable Case).

We say that a hypothesis class $\mathcal{H}$ is online learnable if there exists a learning rule $\mathcal{A}$ such that $\operatorname*{\mathbb{E}}\left[\mathcal{M}\bigl{(}\mathcal{A};S\bigr{)}\right]=o(T)$ for every sequence $S$ which is realizable by $\mathcal{H}$ .

Remark 1.2.

Notice that Definition 1.1 corresponds to an oblivious adversary, as it quantifies over the input sequence in advance. This should not be confused with the adversaries considered in the context of privacy which are always adaptive in this work. In the non-private setting, focusing on oblivious adversaries does not affect generality in terms of utility. This is less clear when privacy constraints are involved.111In particular, Golowich and Livni (2021) studied both oblivious and adaptive adversaries, and obtained very different results in these two cases. We emphasize that our results (our mistake bounds) continue to hold even when the realizable sequence is chosen by an adaptive (stateful) adversary, that at every point in time chooses the next input to the algorithm based on all of the previous outputs of the algorithm.

A classical result due to Littlestone (1988) characterizes online learnability (without privacy constraints) in terms of the Littlestone dimension. The latter is a combinatorial parameter of $\mathcal{H}$ which was named after Littlestone by Ben-David et al. (2009).

In particular, Littlestone’s characterization implies the following dichotomy: if $\mathcal{H}$ has finite Littlestone dimension $d$ then there exists a (deterministic) learning rule which makes at most $d$ mistakes on every realizable input sequence. In the complementing case, when the Littlestone dimension of $\mathcal{H}$ is infinite, for every learning rule $\mathcal{A}$ and every $T\in\mathbb{N}$ there exists a realizable sequence $S$ of length $T$ such that $\operatorname*{\mathbb{E}}\left[\mathcal{M}\bigl{(}\mathcal{A};S\bigr{)}\right]\geq T/2$ . In other words, as a function of $T$ , the optimal mistake bound is either uniformly bounded by the Littlestone dimension, or it is $\geq T/2$ . Because of this dichotomy, in some places online learnability is defined with respect to a uniform bound on the number of mistakes (and not just a sublinear one as in the above definition). In this work we follow the more general definition.

We investigate the following questions:

*Can every online learnable class be learned by an algorithm which satisfies challenge differential privacy? What is the optimal mistake bound attainable by private learners? *

Our main result in this part provides an affirmative answer to the first question. We show that for any class $\mathcal{H}$ with Littlestone dimension $d$ there exists an $(\varepsilon,\delta)$ -challenge-DP learning rule which makes at most

[TABLE]

mistakes, with probability $1-\beta$ , on every realizable sequence of length $T$ . Remarkably, our proof provides an efficient transformation taking a non-private learner to a private one: that is, given a black box access to a learning rule $\mathcal{A}$ which makes at most $M$ mistakes in the realizable case, we efficiently construct an $(\varepsilon,\delta)$ -challenge-DP learning rule $\mathcal{A}^{\prime}$ which makes at most $\tilde{O}\left(\frac{M^{2}}{\varepsilon^{2}}\log^{2}\left(\frac{1}{\delta}\right)\log^{2}\left(\frac{T}{\beta}\right)\right)$ mistakes.

1.1.1 Construction overview

We now give a simplified overview of our construction, called POP, which transforms a non-private online learning algorithm into a private one (while maintaining computational efficiency). Let $\mathcal{A}$ be a non-private algorithm, guaranteed to make at most $d$ mistakes in the realizable setting. We maintain $k$ copies of $\mathcal{A}$ . Informally, in every round $i\in[T]$ we do the following:

Obtain an input point $x_{i}$ . 2. 2.

Give $x_{i}$ to each of the $k$ copies of $\mathcal{A}$ to obtain predicted labels $\hat{y}_{i,1},\dots,\hat{y}_{i,k}$ . 3. 3.

Output a “privacy preserving” aggregation $\hat{y}_{i}$ of $\left\{\hat{y}_{i,1},\dots,\hat{y}_{i,k}\right\}$ , which is some variant of noisy majority. This step will only satisfy our notion of challenge-DP. 4. 4.

Obtain the “true” label $y_{i}$ . 5. 5.

Let $\ell\in[k]$ be chosen at random. 6. 6.

Rewind all of the copies of algorithm $\mathcal{A}$ except for the $\ell$ th copy, so that they “forget” ever seeing $x_{i}$ . 7. 7.

Give the true label $y_{i}$ to the $\ell$ th copy of $\mathcal{A}$ .

As we aggregate the predictions given by the copies of $\mathcal{A}$ using (noisy) majority, we know that if the algorithm errs than at least a constant fraction of the copies of $\mathcal{A}$ err. As we feed the true label $y_{i}$ to a random copy, with constant probability, the copy which we do not rewind incurs a mistake at this moment. That is, whenever we make a mistake then with constant probability one of the copies we maintain incurs a mistake. This can happen at most $\approx k\cdot d$ times, since we have $k$ copies and each of them makes at most $d$ mistakes. This allows us to bound the number of mistakes made by our algorithm (w.h.p.). The privacy analysis is more involved. Intuitively, by rewinding all of the copies of $\mathcal{A}$ (except one) in every round, we make sure that a single user can affect the inner state of at most one of the copies. This allows us to efficiently aggregate the predictions given by the copies in a privacy preserving manner. The subtle point is that the prediction we release in time $i$ does require querying all the experts on the current example $x_{i}$ (before rewinding them). Nevertheless, we show that this algorithm is private.

1.1.2 Comparison with Golowich and Livni (2021)

The closest prior work to this manuscript is by Golowich and Livni who also studied the problem of private online classification, but under a more restrictive notion of privacy than challenge-DP. In particular their definition requires that the sequence of predictors which the learner uses to predict in each round does not compromise privacy. In other words, it is as if at each round the learner publishes the entire truth-table of its predictor, rather than just its current prediction. This might be too prohibitive in certain applications such as the chatbot example illustrated above. Golowich and Livni show that even with respect to their more restrictive notion of privacy it is possible to online learn every Littlestone class. However, their mistake bound is doubly exponential in the Littlestone dimension (whereas ours is quadratic), and their construction requires more elaborate access to the non-private learner. In particular, it is not clear whether their construction can be implemented efficiently.

1.2 Additional Related Work

Several works studied the related problem of private learning from expert advice (Dwork et al., 2010a; Jain et al., 2012; Thakurta and Smith, 2013; Dwork and Roth, 2014; Jain and Thakurta, 2014; Agarwal and Singh, 2017; Asi et al., 2022). These works study a variant of the experts problem in which the learning algorithm has access to $k$ experts; on every time step the learning algorithm chooses one of the experts to follow, and then observes the loss of each expert. The goal of the learning algorithm is that its accumulated loss will be competitive with the loss of the best expert in hindsight. In this setting the private data is the sequence of losses observed throughout the execution, and the privacy requirement is that the sequence of experts chosen by the algorithm should not compromise the privacy of the sequence of losses.222Asi et al. (2022) study a more general framework of adaptive privacy in which the private data is an auxiliary sequence $(z_{1},\ldots,z_{T})$ . During the interaction with the learner, these $z_{t}$ ’s are used (possibly in an adaptive way) to choose the sequence of loss functions. When applying these results to our context, the set of experts is the set of hypotheses in the class $\mathcal{H}$ , which means that the outcome of the learner (on every time step) is a complete model (i.e., a hypothesis). That is, in our context, applying prior works on private prediction from expert advice would result in a privacy definition similar to that of Golowich and Livni (2021) that accounts (in the privacy analysis) for releasing complete models, rather than just the predictions, which is significantly more restrictive.

There were a few works that studied private learning in online settings under the constraint of JDP. For example, Shariff and Sheffet (2018) studied the stochastic contextual linear bandits problem under JDP. Here, in every round $t$ the learner receives a context $c_{t}$ , then it selects an action $a_{t}$ (from a fixed set of actions), and finaly it receives a reward $y_{t}$ which depends on $(c_{t},a_{t})$ in a linear way. The learner’s objective is to maximize cumulative reward. The (non-adaptive) definition of JDP means that action $a_{t}$ is revealed only to user $u_{t}$ . Furthermore, it guarantees that the inputs of user $u_{t}$ (specifically the context $c_{t}$ and the reward $y_{t}$ ) do not leak to the other users via the actions they are given, provided that all these other users fix their data in advance. This non-adaptive privacy notion fits the stochastic setting of Shariff and Sheffet (2018), but (we believe) is less suited for adversarial processes like the ones we consider in this work. We also note that the algorithm of Shariff and Sheffet (2018) in fact satisfies the more restrictive privacy definition which applies to the sequence of predictors (rather than the sequence of predictions), similarly to the that of Golowich and Livni (2021).

A parallel (unpublished) work by Nissim et al. studied a related setting, which can be viewed as an “evolving” variant of the private PAC learning model. They also use an adaptive variant of JDP, similar to our notion of privacy, which is tailored to their stochastic setting.

2 Preliminaries

Notation.

Two datasets $S$ and $S^{\prime}$ are called neighboring if one is obtained from the other by adding or deleting one element, e.g., $S^{\prime}=S\cup\{x^{\prime}\}$ . For two random variables $Y,Z$ we write $X\approx_{(\varepsilon,\delta)}Y$ to mean that for every event $F$ it holds that $\Pr[X\in F]\leq e^{\varepsilon}\cdot\Pr[Y\in F]+\delta$ , and $\Pr[Y\in F]\leq e^{\varepsilon}\cdot\Pr[X\in F]+\delta$ . Throughout the paper we assume that the privacy parameter $\varepsilon$ satisfies $\varepsilon=O(1)$ , but our analyses trivially extend to larger values of epsilon.

The standard definition of differential privacy is,

Definition 2.1 ((Dwork et al., 2006)).

Let $\mathcal{M}$ be a randomized algorithm that operates on datasets. Algorithm $\mathcal{M}$ is $(\varepsilon,\delta)$ -differentially private (DP) if for any two neighboring datasets $S,S^{\prime}$ we have $\mathcal{M}(S)\approx_{(\varepsilon,\delta)}\mathcal{M}(S^{\prime})$ .

The Laplace mechanism.

The most basic constructions of differentially private algorithms are via the Laplace mechanism as follows.

Definition 2.2.

A random variable has probability distribution $\mathop{\rm{Lap}}\nolimits(\gamma)$ if its probability density function is $f(x)=\frac{1}{2\gamma}\exp(-|x|/\gamma)$ , where $x\in\R$ .

Definition 2.3 (Sensitivity).

A function $f$ that maps datasets to the reals has sensitivity $\Delta$ if for every two neighboring datasets $S$ and $S^{\prime}$ it holds that $|f(S)-f(S^{\prime})|\leq\Delta$ .

Theorem 2.4 (The Laplace Mechanism (Dwork et al., 2006)).

Let $f$ be a function that maps datasets to the reals with sensitivity $\Delta$ . The mechanism $\mathcal{A}$ that on input $S$ adds noise with distribution $\mathop{\rm{Lap}}\nolimits(\frac{\Delta}{\varepsilon})$ to the output of $f(S)$ preserves $(\varepsilon,0)$ -differential privacy.

Joint differential privacy.

The standard definition of differential privacy (Definition 2.1) captures a setting in which the entire output of the computation may be publicly released without compromising privacy. While this is a very desirable requirement, it is sometimes too restrictive. Indeed, Kearns et al. (2015) considered a relaxed setting in which we aim to analyze a dataset $S=(x_{1},\dots,x_{n})$ , where every $x_{i}$ represents the information of user $i$ , and to obtain a vector of outcomes $(y_{1},\dots,y_{n})$ . This vector, however, is not made public. Instead, every user $i$ only receives its “corresponding outcome” $y_{i}$ . This setting potentially allows the outcome $y_{i}$ to strongly depend on the the input $x_{i}$ , without compromising the privacy of the $i$ th user from the view point of the other users.

Definition 2.5 ((Kearns et al., 2015)).

Let $\mathcal{M}:X^{n}\rightarrow Y^{n}$ be a randomized algorithm that takes a dataset $S\in X^{n}$ and outputs a vector $\vec{y}\in Y^{n}$ . Algorithm $\mathcal{M}$ satisfies $(\varepsilon,\delta)$ -joint differential privacy (JDP) if for every $i\in[n]$ and every two datasets $S,S^{\prime}\in X^{n}$ differing only on their $i$ th point it holds that $\mathcal{M}(S)_{-i}\approx_{(\varepsilon,\delta)}\mathcal{M}(S^{\prime})_{-i}$ . Here $\mathcal{M}(S)_{-i}$ denotes the (random) vector of length $n-1$ obtained by running $(y_{1},\dots,y_{n})\leftarrow\mathcal{M}(S)$ and returning $(y_{1},\dots,y_{i-1},y_{i+1},\dots,y_{n})$ .

In words, consider an algorithm $\mathcal{M}$ that operates on the data of $n$ individuals and outputs $n$ outcomes $y_{1},\dots,y_{n}$ . This algorithm is JDP if changing only the $i$ th input point $x_{i}$ has almost no affect on the outcome distribution of the other outputs (but the outcome distribution of $y_{i}$ is allowed to strongly depend on $x_{i}$ ). Kearns et al. (2015) showed that this setting fits a wide range of problems in economic environments.

Example 2.6 ((Nahmias et al., 2019)).

Suppose that a city water corporation is interested in promoting water conservation. To do so, the corporation decided to send each household a customized report indicating whether their water consumption is above or below the median consumption in the neighborhood. Of course, this must be done in a way that protects the privacy of the neighbors. One way to tackle this would be to compute a privacy preserving estimation $z$ for the median consumption (satisfying Definition 2.1). Then, in each report, we could safely indicate whether the household’s water consumption is bigger or smaller than $z$ . While this solution is natural and intuitive, it turns out to be sub-optimal: We can obtain better utility by designing a JDP algorithm that directly computes a different outcome for each user (“above” or “below”), which is what we really aimed for, without going through a private median computation.

Algorithm AboveThreshold.

Consider a large number of low sensitivity functions $f_{1},f_{2},\dots,f_{T}$ which are given (one by one) to a data curator (holding a dataset $S$ ). Algorithm AboveThreshold allows for privately identifying the queries $f_{i}$ whose value $f_{i}(S)$ is (roughly) greater than some threshold $t$ .

Even though the number of possible rounds is unbounded, algorithm AboveThreshold preserves differential privacy. Note, however, that AboveThreshold is an interactive mechanism, while the standard definition of differential privacy (Definition 2.1) is stated for non-interactive mechanisms, that process their input dataset, release an output, and halt. The adaptation of DP to such interactive settings is done via a game between the (interactive) mechanism and an adversary that specifies the inputs to the mechanism and observes its outputs. Intuitively, the privacy requirement is that the view of the adversary at the end of the execution should be differentially private w.r.t. the inputs given to the mechanism. Formally,

Definition 2.7 (DP under adaptive queries (Dwork et al., 2006; Bun et al., 2017)).

Let $\mathcal{M}$ be a mechanism that takes an input dataset and answers a sequence of adaptively chosen queries (specified by an adversary $\mathcal{B}$ and chosen from some family $Q$ of possible queries). Mechanism $\mathcal{M}$ is $(\varepsilon,\delta)$ -differentially private if for every adversary $\mathcal{B}$ we have that $\texttt{AdaptiveQuery}_{\mathcal{M},\mathcal{B},Q}$ (defined below) is $(\varepsilon,\delta)$ -differentially private (w.r.t. its input bit $b$ ).

Theorem 2.8 ((Dwork et al., 2009; Hardt and Rothblum, 2010; Kaplan et al., 2021)).

Algorithm AboveThreshold is $(\varepsilon,\delta)$ -differentially private.

A private counter.

In the setting of algorithm AboveThreshold, the dataset is fixed in the beginning of the execution, and the queries arrive sequentially one by one. Dwork et al. (2010a) and Chan et al. (2010) considered a different setting, in which the data arrives sequentially. In particular, they considered the counter problem where in every time step $i\in[T]$ we obtain an input bit $x_{i}\in\{0,1\}$ (representing the data of user $i$ ) and must immediately respond with an approximation for the current sum of the bits. That is, at time $i$ we wish to release an approximation for $x_{1}+x_{2}+\dots+x_{i}$ .

Similarly to our previous discussion, this is an interactive setting, and privacy is defined via a game between a mechanism $\mathcal{M}$ and an adversary $\mathcal{B}$ that adaptively determines the inputs for the mechanism.

Definition 2.9 (DP under adaptive inputs (Dwork et al., 2006, 2010a; Chan et al., 2010; Kaplan et al., 2021; Jain et al., 2021)).

Let $\mathcal{M}$ be a mechanism that in every round $i$ obtains an input point $x_{i}$ (representing the information of user $i$ ) and outputs a response $a_{i}$ . Mechanism $\mathcal{M}$ is $(\varepsilon,\delta)$ -differentially private if for every adversary $\mathcal{B}$ we have that $\texttt{AdaptiveInput}_{\mathcal{M},\mathcal{B}}$ (defined below) is $(\varepsilon,\delta)$ -differentially private (w.r.t. its input bit $b$ ).

Theorem 2.10 (Private counter (Dwork et al., 2010a; Chan et al., 2010; Jain et al., 2021)).

There exists a mechanism $\mathcal{M}$ that in each round $i\in[T]$ obtains an input bit $x_{i}\in\{0,1\}$ and outputs a response $a_{i}\in\N$ with the following properties:

$\mathcal{M}$ * is $(\varepsilon,0)$ -differentially private (as in Definition 2.9).* 2. 2.

Let $s$ denote the random coins of $\mathcal{M}$ . Then there exists an event $E$ such that: (1) $\Pr[s\in E]\geq 1-\beta$ , and (2) Conditioned on every $s\in E$ , for every input sequence $(x_{1},\dots,x_{T})$ , the answers $(a_{1},\dots,a_{T})$ satisfy

[TABLE]

3 Challenge Differential Privacy

We now introduce the privacy definition we consider in this work is. Intuitively, the requirement is that even an adaptive adversary controlling all of the users except Alice, cannot learn much information about the interaction Alice had with the algorithm.

Definition 3.1.

Consider an algorithm $\mathcal{M}$ that, in each round $i\in[T]$ obtains an input point $x_{i}$ , outputs a “predicted” label $\hat{y}_{i}$ , and obtains a “true” label $y_{i}$ . We say that algorithm $\mathcal{M}$ is $(\varepsilon,\delta)$ -challenge differentially private if for any adversary $\mathcal{B}$ we have that $\texttt{OnlineGame}_{\mathcal{M},\mathcal{B},T}$ , defined below, is $(\varepsilon,\delta)$ -differentially private (w.r.t. its input bit $b$ ).

Remark 3.2.

For readability, we have simplified Definition 3.1 and tailored it to the setting of online learning. Our algorithms satisfy a stronger variant of the definition, in which the adversary may adaptively choose the “true” labels $y_{i}$ also based on the “predicted” labels $\hat{y}_{i}$ . See Appendix A for the generalized definition.

Composition and post-processing.

Composition and post-processing for challenge-DP follows immediately from their analogues for (standard) DP. Formally, composition is defined via the following game, called CompositionGame, in which a “meta adversary” $\mathcal{B}^{*}$ is trying to guess an unknown bit $b\in\{0,1\}$ . The meta adversary $\mathcal{B}^{*}$ is allowed to (adaptively) invoke $k$ executions of the game specified in Algorithm 4, where all of these $k$ executions are done with the same (unknown) bit $b$ . See Algorithm 5. The following theorem follows immediately from standard composition theorems for differential privacy (Dwork et al., 2010b).

Theorem 3.3 (special case of (Dwork et al., 2010b)).

For every $\mathcal{B}^{*}$ , every $m\in\N$ and every $\varepsilon,\delta,\delta^{\prime}\geq 0$ it holds that $\texttt{CompositionGame}_{\mathcal{B}^{*},m,\varepsilon,\delta}$ is $(\varepsilon^{\prime},m\delta+\delta^{\prime})$ -differentially private (w.r.t. the input bit $b$ ) for

[TABLE]

Group privacy.

We show that challenge-DP is closed under group privacy. This is more subtle than the composition argument. In fact, we first need to define what do we mean by “group privacy” in the context of challenge-DP. This is done using the parameter $g$ in algorithm OnlineGame.

Theorem 3.4.

Let $\mathcal{M}$ be an algorithm that in each round $i\in[T]$ obtains an input point $x_{i}$ , outputs a “predicted” label $\hat{y}_{i}$ , and obtains a “true” label $y_{i}$ . If $\mathcal{M}$ is $(\varepsilon,\delta)$ -challenge-DP then for every $g\in\N$ and every adversary $\mathcal{B}$ (posing at most $g$ challenges) we have that $\texttt{OnlineGame}_{\mathcal{M},\mathcal{B},T,g}$ is $(g\varepsilon,g\cdot e^{\varepsilon g}\cdot\delta)$ -differentially private.

Proof.

Fix $g\in\N$ and fix an adversary $\mathcal{B}$ (that poses at most $g$ challenge rounds). We consider a sequence of games $\mathcal{W}_{0},\mathcal{W}_{1},\dots,\mathcal{W}_{g}$ , where $\mathcal{W}_{\ell}$ is defined as follows.

Initialize algorithm $\mathcal{M}$ and the adversary $\mathcal{B}$ . 2. 2.

For round $i=1,2,\dots,T$ :

(a)

Obtain a challenge indicator $c_{i}$ and two labeled inputs $(x_{i,0},y_{i,0})$ and $(x_{i,1},y_{i,1})$ from $\mathcal{B}$ . 2. (b)

If $\sum_{j=1}^{i}c_{j}>\ell$ then set $(w_{i},z_{i})=(x_{i,0},y_{i,0})$ . Otherwise set $(w_{i},z_{i})=(x_{i,1},y_{i,1})$ . 3. (c)

Feed $w_{i}$ to algorithm $\mathcal{M}$ , obtain an outcome $\hat{y}_{i}$ , and feed it $z_{i}$ . 4. (d)

If $c_{i}=0$ then set $\tilde{y}_{i}=\hat{y}_{i}$ . Otherwise set $\tilde{y}_{i}=\bot$ . 5. (e)

Give $\tilde{y}_{i}$ to $\mathcal{B}$ . 3. 3.

Output $\tilde{y}_{1},\dots,\tilde{y}_{T}$ and the internal randomness of $\mathcal{B}$ .

That is, $\mathcal{W}_{\ell}$ simulates the online game between $\mathcal{M}$ and $\mathcal{B}$ , where during the first $\ell$ challenge rounds algorithm $\mathcal{M}$ is given $(x_{i,1},y_{i,1})$ , and in the rest of the challenge rounds algorithm $\mathcal{M}$ is given $(x_{i,0},y_{i,0})$ . Note that

[TABLE]

We claim that for every $0<\ell\leq g$ it holds that $\mathcal{W}_{\ell-1}\approx_{(\varepsilon,\delta)}\mathcal{W}_{\ell}$ . To this end, fix $0<\ell\leq g$ and consider an adversary $\widehat{\mathcal{B}}$ , that poses at most one challenge, defined as follows. Algorithm $\widehat{\mathcal{B}}$ runs $\mathcal{B}$ internally. In every round $i$ , algorithm $\widehat{\mathcal{B}}$ obtains from $\mathcal{B}$ a challenge bit $c_{i}$ and two labeled inputs $(x_{i,0},y_{i,0})$ and $(x_{i,1},y_{i,1})$ . As long as $\mathcal{B}$ did not pose its $\ell$ th challenge, algorithm $\widehat{\mathcal{B}}$ outputs $(x_{i,1},y_{i,1}),(x_{i,1},y_{i,1})$ . During the round $i$ in which $\mathcal{B}$ poses its $\ell$ th challenge, algorithm $\mathcal{B}$ outputs $(x_{i,0},y_{i,0}),(x_{i,1},y_{i,1})$ . This is the challenge round posed by algorithm $\widehat{\mathcal{B}}$ . In every round $t$ afterwards, algorithm $\widehat{\mathcal{B}}$ outputs $(x_{i,0},y_{i,0}),(x_{i,0},y_{i,0})$ . When algorithm $\widehat{\mathcal{B}}$ obtains an answer $\tilde{y}_{i}$ it sets $\tilde{\raisebox{0.0pt}[0.85pt]{$ \tilde{y} $}}_{i}=\begin{cases}\tilde{y}_{i},\text{ if }c_{i}=0\\ \bot,\text{ if }c_{i}=1\end{cases}$ and gives $\tilde{\raisebox{0.0pt}[0.85pt]{$ \tilde{y} $}}_{i}$ to algorithm $\mathcal{B}$ .

As $\widehat{\mathcal{B}}$ is an adversary that poses (at most) one challenge, by the privacy properties of $\mathcal{M}$ we know that $\texttt{OnlineGame}_{\mathcal{M},\widehat{\mathcal{B}},T}$ is $(\varepsilon,\delta)$ -DP. Recall that the output of $\texttt{OnlineGame}_{\mathcal{M},\widehat{\mathcal{B}},T}$ includes all of the randomness of $\widehat{\mathcal{B}}$ , as well as the answers $\tilde{y}_{t}$ generated throughout the game. This includes the randomness of $\mathcal{B}$ (which $\widehat{\mathcal{B}}$ runs internally), and hence, determines also all of the $\tilde{\raisebox{0.0pt}[0.85pt]{$ \tilde{y} $}}_{i}$ ’s defined by $\widehat{\mathcal{B}}$ throughout the interaction. Let $P$ be a post-processing procedure that takes the output of $\texttt{OnlineGame}_{\mathcal{M},\widehat{\mathcal{B}},T}$ and returns the randomness of $\mathcal{B}$ as well as $(\tilde{\raisebox{0.0pt}[0.85pt]{$ \tilde{y} $}}_{1},\dots,\tilde{\raisebox{0.0pt}[0.85pt]{$ \tilde{y} $}}_{T})$ . By closure of DP to post-processing, we have that

[TABLE]

Now note that

[TABLE]

and hence $\mathcal{W}_{\ell-1}\approx_{(\varepsilon,\delta)}\mathcal{W}_{\ell}$ . Overall we have that

[TABLE]

This shows that $\texttt{OnlineGame}_{\mathcal{A},\mathcal{B},T,g}$ is $(g\varepsilon,g\cdot e^{\varepsilon g}\cdot\delta)$ -differentially private, thereby completing the proof. ∎

4 Online Classification under Challenge Differential Privacy

Towards presenting our private online learner, we introduce a variant of algorithm AboveThreshold with additional guarantees, which we call ChallengeAT. Recall that AboveThreshold “hides” arbitrary modifications to a single input point. Intuitively, the new variant we present aims to hide both an arbitrary modification to a single input point and an arbitrary modification to a single query throughout the execution. Consider algorithm ChallengeAT.

Remark 4.1.

When we apply ChallengeAT, it sets $\lambda=O\left(\frac{1}{\varepsilon}\log(T)\log\left(\frac{T}{\beta}\right)\right)$ . Technically, for this it has to know $T$ and $\beta$ . To simplify the description this is not explicit in our algorithms.

The utility guarantees of ChallengeAT are straightforward. The following theorem follows by bounding (w.h.p.) all the noises sampled throughout the execution (when instantiating ChallengeAT with the private counter from Theorem 2.10).333The event $E$ occurs when all the Laplace noises of the counter and ChallengeAT are within a factor of $\log(T/\beta)$ of their expectation.

Theorem 4.2.

Let $s$ denote the random coins of ChallengeAT. Then there exists an event $E$ such that: (1) $\Pr[s\in E]\geq 1-\beta$ , and (2) Conditioned on every $s\in E$ , for every input dataset $S$ and every sequence of $T$ queries $(f_{1},\dots,f_{T})$ it holds that

*Algorithm ChallengeAT does not halt before the * $r$ th time in which it outputs $\sigma_{i}=1$ . 2. 2.

For every $i$ such that $\sigma_{i}=1$ it holds that $f_{i}(S)\geq t-O\left(\frac{\Delta}{\varepsilon}\sqrt{r+\lambda}\ln(\frac{r+\lambda}{\delta})\log(\frac{T}{\beta})\right)$ 3. 3.

For every $i$ such that $\sigma_{i}=0$ it holds that $f_{i}(S)\leq t+O\left(\frac{\Delta}{\varepsilon}\sqrt{r+\lambda}\ln(\frac{r+\lambda}{\delta})\log(\frac{T}{\beta})\right)$

where $\lambda=O\left(\frac{1}{\varepsilon}\log(T)\log\left(\frac{T}{\beta}\right)\right)$ is the error of the counter of Theorem 2.10.

The privacy guarantees of ChallengeAT are defined via a game with an adversary $\mathcal{B}$ whose goal is to guess a secret bit $b$ . At the beginning of the game, the adversary chooses two neighboring datasets $S_{0},S_{1}$ , and ChallengeAT is instantiated with $S_{b}$ . Then throughout the game the adversary specifies queries $f_{i}$ and observes the output of ChallengeAT on these queries. At some special round $i^{*}$ , chosen by the adversary, the adversary specifies two queries $f_{i^{*}}^{0},f_{i^{*}}^{1}$ , where only $f_{i^{*}}^{b}$ is fed into ChallengeAT. In round $i^{*}$ the adversary does not get to see the answer of ChallengeAT on $f_{i^{*}}^{b}$ (otherwise it could easily learn the bit $b$ since $f_{i^{*}}^{0},f_{i^{*}}^{1}$ may be very different). The formal statement of this game is given in algorithm $\texttt{ChallengeAT{\scalebox{0.6}[1.0]{-}}Game}_{\mathcal{B}}$ .

Theorem 4.3.

For every adversary $\mathcal{B}$ it holds that $\texttt{ChallengeAT{\scalebox{0.6}[1.0]{-}}Game}_{\mathcal{B}}$ is $\left(O(\varepsilon),O(\delta)\right)$ -DP w.r.t. the bit $b$ (the input of the game).

Proof.

Fix an adversary $\mathcal{B}$ . Let CATG denote the algorithm $\texttt{ChallengeAT{\scalebox{0.6}[1.0]{-}}Game}_{\mathcal{B}}$ with this fixed $\mathcal{B}$ . Consider a variant of algorithm CATG, which we call $\texttt{CATG}{\scalebox{0.6}[1.0]{-}}{\rm noCount}$ defined as follows. During the challenge round $i$ , inside the call to ChallengeAT, instead of feeding $\sigma_{i}$ to the PrivateCounter we simply feed it 0 (in Step 3d of ChallengeAT).

By the privacy properties of PrivateCounter (Theorem 2.10), for every $b\in\{0,1\}$ we have that

[TABLE]

so it suffices to show that $\texttt{CATG}{\scalebox{0.6}[1.0]{-}}{\rm noCount}$ is DP (w.r.t. $b$ ). Now observe that the execution of PrivateCounter during the execution of $\texttt{CATG}{\scalebox{0.6}[1.0]{-}}{\rm noCount}$ can be simulated from the view of the adversary $\mathcal{B}$ (the only bit that ChallengeAT feeds the counter which is not in the view of the adversary is the one of the challange round which we replaced by zero in $\texttt{CATG}{\scalebox{0.6}[1.0]{-}}{\rm noCount}$ ). Hence, we can generate the view of $\mathcal{B}$ in algorithm CATG by interacting with AboveThreshold instead of with ChallengeAT. This is captured by algorithm $\texttt{CAT{\scalebox{0.6}[1.0]{-}}G}{\scalebox{0.6}[1.0]{-}}{\rm AboveThrehold}$ .

This algorithm is almost identical to $\texttt{CATG}{\scalebox{0.6}[1.0]{-}}{\rm noCount}$ , except for the fact that AboveThreshold might halt the execution itself (even without the halting condition on the outcome of PrivateCounter). However, by the utility guarantees of PrivateCounter, with probability at least $1-\delta$ it never errs by more than $\lambda$ , in which case algorithm AboveThreshold never halts prematurely. Hence, for every bit $b\in\{0,1\}$ we have that

[TABLE]

So it suffices to show that $\texttt{CATG}{\scalebox{0.6}[1.0]{-}}{\rm AboveThrehold}$ is DP (w.r.t. its input bit $b$ ). This almost follows directly from the privacy guarantees of AboveThreshold, since $\texttt{CATG}{\scalebox{0.6}[1.0]{-}}{\rm AboveThrehold}$ interacts only with this algorithm, except for the fact that during the challenge round $i$ the adversary $\mathcal{B}$ specifies two queries (and only one of them is fed into AboveThreshold). To bridge this gap, we consider one more (and final) modification to the algorithm, called $\widehat{\texttt{CATG}}{\scalebox{0.6}[1.0]{-}}{\rm AboveThrehold}$ . This algorithm is identical to $\texttt{CATG}{\scalebox{0.6}[1.0]{-}}{\rm AboveThrehold}$ , except that in Step 4c we do not feed $f_{i}^{b}$ to AboveThreshold if $c_{i}=1$ . That is, during the challenge round we do not interact with AboveThreshold.

Now, by the privacy properties of AboveThreshold we have that $\widehat{\texttt{CATG}}{\scalebox{0.6}[1.0]{-}}{\rm AboveThrehold}$ is DP (w.r.t. its input bit $b$ ). Furthermore, when algorithm AboveThreshold does not halt prematurely, we have that $\widehat{\texttt{CATG}}{\scalebox{0.6}[1.0]{-}}{\rm AboveThrehold}$ is identical to $\texttt{CATG}{\scalebox{0.6}[1.0]{-}}{\rm AboveThrehold}$ . Therefore, for every bit $b\in\{0,1\}$ we have

[TABLE]

Overall we get that

[TABLE]

∎

4.1 Algorithm POP

We are now ready to present our private online prediction algorithm. Consider algorithm POP (see Algorithm 9).

We now analyze the privacy guarantees of POP.

Theorem 4.4.

Algorithm POP is $\left(O(\varepsilon),O(\delta)\right)$ -Challenge-DP. That is, For every adversary $\mathcal{B}$ it holds that $\texttt{OnlineGame}_{\texttt{POP},\mathcal{B}}$ is $\left(O(\varepsilon),O(\delta)\right)$ -DP w.r.t. the bit $b$ (the input of the game).

Proof.

Let $\mathcal{B}$ be an adversary that playes in OnlineGame against POP, posing at most 1 challenge. That is, at one time step $i$ , the adversary specifies two inputs $(x_{i}^{0},y_{i}^{0}),(x^{1}_{i},y^{1}_{i})$ , algorithm POP processes $(x_{i}^{b},y_{i}^{b})$ , and the adversary does not see the prediction $\hat{y}_{i}$ at this time step. We need to show that the view of the adversary is DP w.r.t. the bit $b$ . To show this, we observe that the view of $\mathcal{B}$ can be generated (up to a small statistical distance of $\delta$ ) by interacting with ChallengeAT as in the game ChallengeAT-Game. Formally, consider the following adversary $\hat{\mathcal{B}}$ that simulates $\mathcal{B}$ while interacting with ChallengeAT instead of POP.

As $\hat{\mathcal{B}}$ only interacts with ChallengeAT, its view at the end of the execution (which includes the view of the simulated $\mathcal{B}$ ) is DP w.r.t. the bit $b$ . Furthermore, the view of the simulated $\mathcal{B}$ generated in this process is almost identical to the view of $\mathcal{B}$ had it interacted directly with POP. Specifically, the only possible difference is that the computation of $\hat{y}_{i}$ in Step 3(e)ii of $\hat{\mathcal{B}}$ might not be well-defined. But this does not happen when ChallengeAT maintains correctness, which holds with probability at least $1-\delta$ .

Overall, letting $\texttt{ChallengeAT-Game}_{\hat{\mathcal{B}}\raise-1.50694pt\hbox{$ | $}_{\mathcal{B}}}$ denote the view of the simulated $\mathcal{B}$ at the end of the interaction of $\hat{\mathcal{B}}$ with ChallengeAT, we have that

[TABLE]

∎

We proceed with the utility guarantees of POP. See Appendix C for an extension to the agnostic setting.

Theorem 4.5.

When executed with a learner $\mathcal{A}$ that makes at most $d$ mistakes and with parameters $k=\tilde{O}\left(\frac{d}{\varepsilon^{2}}\log^{2}(\frac{1}{\delta})\log^{2}(\frac{T}{\beta})\right)$ and $r=O\left(dk+\ln\left(\frac{1}{\beta}\right)\right)$ , then with probability at least $(1-\beta)$ the number of mistakes made by algorithm POP is bounded by $\tilde{O}\left(\frac{d^{2}}{\varepsilon^{2}}\log^{2}(\frac{1}{\delta})\log^{2}(\frac{T}{\beta})\right).$

Proof.

By Theorem 4.2, with probability $(1-\beta)$ over the internal coins of ChallengeAT, for every input sequence, its answers are accurate up to error of

[TABLE]

where in our case, the sensitivity $\Delta$ is $1$ , and the error of the counter $\lambda$ is at most $O\left(\frac{1}{\varepsilon}\log(T)\log\left(\frac{T}{\delta}\right)\right)$ by Theorem 2.10. We continue with the proof assuming that this event occurs. Furthermore, we set $k=\Omega\left({\rm error}_{\rm CAT}\right)$ , large enough, such that if less than $\frac{1}{5}$ the experts disagree with the other experts, then algorithm POP returns the majority vote with probability 1.

Consider the execution of algorithm POP and define $1/5$ -Err be a random variable that counts the number of time steps in which at least $1/5$ th of the experts make an error. That is

[TABLE]

We also define the random variable

[TABLE]

That is expertAdvance counts the number of times steps in which the random expert we choose (the $\ell_{i}$ th expert) errs. Note that the $\ell_{i}$ th expert is the expert that gets the “true” label $y_{i}$ as feedback. As we run $k$ experts, and as each of them is guaranteed to make at most $d$ mistakes, we get that

[TABLE]

We now show that with high probability 1/5-Err is not much larger than ${\rm expertAdvance}$ . Let $i$ be a time step in which at least $1/5$ fraction of the experts err. As the choice of $\ell_{i}$ (the expert we update) is random, then with probability at least $\frac{1}{5}$ the chosen expert also errs. It is therefore unlikely that 1/5-Err is much larger than ${\rm expertAdvance}$ , which is bounded by $kd$ . Specifically, by standard concentration arguments (see Appendix B for the precise version we use) it holds that

[TABLE]

Note that when at least $1/5$ of the experts disagree with other experts then at least $1/5$ of the experts err. It follows that 1/5-Err upper bounds the number of times in which algorithm ChallengeAT returns an “above threshold” answer. Hence, by setting $r>18dk+18+\ln\left(\frac{1}{\beta}\right)$ we ensure that w.h.p. algorithm ChallangeAT does not halt prematurely (and hence POP does not either).

Furthermore our algorithm errs either when there is a large disagreement between the experts or when all experts err. It follows that 1/5-Err also upper bounds the number of times which our algorithm errs.

Overall, by setting $r=O\left(dk+\ln\left(\frac{1}{\beta}\right)\right)$ we ensure that POP does not halt prematurely, and by setting $k=O\left(\frac{\Delta}{\varepsilon}\sqrt{r+\lambda}\ln(\frac{r+\lambda}{\delta})\log(\frac{T}{\beta})\right)$ we ensure that POP does not err too many times throughout the execution. Combining the requirement on $r$ and on $k$ , it suffices to take

[TABLE]

in which case algorithm POP makes at most $\tilde{O}\left(\frac{d^{2}}{\varepsilon^{2}}\log^{2}(\frac{1}{\delta})\log^{2}(\frac{T}{\beta})\right)$ with high probability. ∎

Appendix A General Variant of challenge-DP

Definition A.1.

Consider an algorithm $\mathcal{M}$ that, in each phase $i\in[T]$ , conducts an arbitrary interaction with the $i$ th user. We say that algorithm $\mathcal{M}$ is $(\varepsilon,\delta)$ -challenge differentially private if for any adversary $\mathcal{B}$ we have that $\texttt{GeneralGame}_{\mathcal{M},\mathcal{B},T}$ , defined below, is $(\varepsilon,\delta)$ -differentially private (w.r.t. its input bit $b$ ).

Appendix B A Coin Flipping Game

Consider algorithm 12 which specifies an $m$ -round “coin flipping game” against an adversary $\mathcal{B}$ . In this game, the adaptively chooses the biases of the coins we flip. In every flip, the adversary might gain a reward or incur a “budget loss”. The adversary aims to maximize the rewards it collects before its budget runs out.

The next theorem states that no adversary can obtain reward much larger than $k$ in this game. Intuitively, this holds because in every time step $i$ , the probability of $X_{i}=2$ is not much smaller than the probability that $X_{i}$ , then (w.h.p.) it is very unlikely that the number of rewards would be much larger than $k$ .

Theorem B.1 ([Gupta et al., 2010, Kaplan et al., 2021]).

For every adversary’s strategy, every $k\geq 0$ , every $m\in\N$ , and every $\lambda\in\R$ , we have

[TABLE]

Appendix C Extension to the Agnostic Case

In this section we extend the analysis of POP to the agnostic setting. We use the tilde-notation to hide logarithmic factors in $T,\frac{1}{\delta},\frac{1}{\beta},\frac{1}{\varepsilon}$ .

Theorem C.1 ([Ben-David et al., 2009]).

For any hypothesis class $H$ and scalar $M^{*}\geq 0$ there exists an online learning algorithm such that for any sequence $((x_{1},y_{1}),\dots,(x_{T},y_{T}))$ satisfying $\min\limits_{h\in H}\sum_{i=1}^{T}|h(x_{i})-y_{i}|\leq M^{*}$ the predictions $\hat{y}_{1},\dots,\hat{y}_{T}$ given by the algorithm satisfy

[TABLE]

Definition C.2.

For parameters $u<w$ , let $\texttt{POP}_{[u,w]}$ denote a variant of POP in which we halt the execution after the $v$ th time in which we err, for some arbitrary value $u\leq v\leq w$ . (Note that the execution might halt even before that, by the halting condition of POP itself.) This could be done while preserving privacy (for appropriate values of $u<w$ ) by using the counter of Theorem 2.10 for privately counting the number of mistakes.

Lemma C.3.

Let $H$ be a hypothesis class with $d=\operatorname{\rm Ldim}(H)$ , and let $\mathcal{A}$ denote the non-private algorithm from Theorem C.1 with $M^{*}=d\ln(T)$ . Denote $k=\tilde{\Theta}\left(\frac{d^{2}}{\varepsilon}\right)$ , $r=u=\Theta\left(kd\ln(T)\right)$ , and $w=2u$ . Consider executing $\texttt{POP}_{[u,w]}$ with $\mathcal{A}$ and with parameters $k,r$ on an adaptively chosen sequence of inputs $(x_{1},y_{1}),\dots,(x_{i^{*}},y_{i^{*}})$ , where $i^{*}\leq T$ denotes the time at which $\texttt{POP}_{[u,w]}$ halts. Then, with probability at least $(1-\beta)$ it holds that

[TABLE]

Proof sketch.

Similarly to the proof of Theorem 4.5, we set $k=\tilde{\Omega}\left(\frac{d^{2}}{\varepsilon}\right)$ , and assume that if less than $\frac{1}{5}$ the experts disagree with the other experts, then algorithm $\texttt{POP}_{[u,w]}$ returns the majority vote with probability 1.

Let $1/5$ -Err denote the random variable that counts the number of time steps in which at least $1/5$ th of the experts make an error. As in the proof of Theorem 4.5, $1/5$ -Err upper bounds both the number of mistakes made by $\texttt{POP}_{[u,w]}$ , which we denote by ${\rm OurError}$ , as well as the number of times in which algorithm ChallengeAT returns an “above threshold” answer, which we denote by ${\rm NumTop}$ . By Theorem 4.2, we know that (w.h.p.) ${\rm NumTop}\geq r$ . Also let ${\rm WorstExpert}$ denote the largest number of mistakes made by a single expert.

Consider the time $i^{*}$ at which $\texttt{POP}_{[u,w]}$ halts. If it halts because $u\leq v\leq w$ mistakes have been made, then

[TABLE]

Alternatively, if $\texttt{POP}_{[u,w]}$ halts after $r$ “above threshold” answer, then

[TABLE]

At any case, when $\texttt{POP}_{[u,w]}$ halts it holds that at least one expert made at least $\Omega\left(d\ln(T)\right)$ mistakes. Therefore, by Theorem C.1, we have that $\mathop{\rm{OPT}}\nolimits_{i^{*}}\geq d\ln(T)$ .

∎

Theorem C.4.

Let $H$ be a hypothesis class with $\operatorname{\rm Ldim}(H)=d$ . There exists an $(\varepsilon,\delta)$ -Challenge-DP online learning algorithm providing the following guarantee. When executed on an adaptively chosen sequence of inputs $(x_{1},y_{1}),\dots,(x_{T},y_{T})$ , then the algorithm makes at most $\tilde{O}\left(\frac{d\cdot\mathop{\rm{OPT}}\nolimits}{\varepsilon^{2}}+\frac{d^{2}}{\varepsilon^{2}}\right)$ mistakes (w.h.p.), where

[TABLE]

Proof sketch.

This is obtained by repeatedly re-running $\texttt{POP}_{[u,w]}$ , with the parameter setting specified in Lemma C.3. We refer to the time span of every single execution of $\texttt{POP}_{[u,w]}$ as a phase.

By construction, in every phase, $\texttt{POP}_{[u,w]}$ makes at most $w=\tilde{\Theta}(kd)$ mistakes. By Lemma C.3 every hypothesis in $H$ makes at least $d\cdot\ln(T)$ mistakes in this phase. Therefore, there could be at most $\tilde{O}\left(\max\left\{1\,,\,\frac{\mathop{\rm{OPT}}\nolimits}{d}\right\}\right)$ phases, during which we incur a total of at most $\tilde{O}\left(\frac{d\cdot\mathop{\rm{OPT}}\nolimits}{\varepsilon^{2}}+\frac{d^{2}}{\varepsilon^{2}}\right)$ mistakes. ∎

Bibliography23

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Agarwal and Singh [2017] Naman Agarwal and Karan Singh. The price of differential privacy for online learning. In ICML , volume 70 of Proceedings of Machine Learning Research , pages 32–40. PMLR, 06–11 Aug 2017.
2Asi et al. [2022] Hilal Asi, Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private online prediction from experts: Separations and faster rates. Co RR , abs/2210.13537, 2022.
3Ben-David et al. [2009] Shai Ben-David, Dávid Pál, and Shai Shalev-Shwartz. Agnostic online learning. In COLT , 2009.
4Bun et al. [2017] Mark Bun, Thomas Steinke, and Jonathan Ullman. Make up your mind: The price of online queries in differential privacy. In Proceedings of the twenty-eighth annual ACM-SIAM symposium on discrete algorithms , pages 1306–1325. SIAM, 2017.
5Chan et al. [2010] T.-H. Hubert Chan, Elaine Shi, and Dawn Song. Private and continual release of statistics. In ICALP (2) , volume 6199 of Lecture Notes in Computer Science , pages 405–417. Springer, 2010.
6Dwork and Roth [2014] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science , 9(3-4):211–407, 2014.
7Dwork et al. [2006] Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In TCC , pages 265–284, 2006.
8Dwork et al. [2009] Cynthia Dwork, Moni Naor, Omer Reingold, Guy N. Rothblum, and Salil P. Vadhan. On the complexity of differentially private data release: efficient algorithms and hardness results. In STOC , pages 381–390, 2009.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

On Differentially Private Online Predictions

Abstract

1 Introduction

1.1 Private Online Classification

Definition 1.1** (Online Learnability: Realizable Case).**

Remark 1.2**.**

1.1.1 Construction overview

1.1.2 Comparison with Golowich and Livni (2021)

1.2 Additional Related Work

2 Preliminaries

Notation.

Definition 2.1** ((Dwork et al., 2006)).**

The Laplace mechanism.

Definition 2.2**.**

Definition 2.3** (Sensitivity).**

Theorem 2.4** (The Laplace Mechanism (Dwork et al., 2006)).**

Joint differential privacy.

Definition 2.5** ((Kearns et al., 2015)).**

Example 2.6** ((Nahmias et al., 2019)).**

Algorithm AboveThreshold.

Definition 2.7** (DP under adaptive queries (Dwork et al., 2006; Bun et al., 2017)).**

Theorem 2.8** ((Dwork et al., 2009; Hardt and Rothblum, 2010; Kaplan et al., 2021)).**

A private counter.

Definition 2.9** (DP under adaptive inputs (Dwork et al., 2006, 2010a; Chan et al., 2010; Kaplan et al., 2021; Jain et al., 2021)).**

Theorem 2.10** (Private counter (Dwork et al., 2010a; Chan et al., 2010; Jain et al., 2021)).**

3 Challenge Differential Privacy

Definition 3.1**.**

Remark 3.2**.**

Composition and post-processing.

Theorem 3.3** (special case of (Dwork et al., 2010b)).**

Group privacy.

Theorem 3.4**.**

Proof.

4 Online Classification under Challenge Differential Privacy

Remark 4.1**.**

Theorem 4.2**.**

Theorem 4.3**.**

Proof.

4.1 Algorithm POP

Theorem 4.4**.**

Proof.

Theorem 4.5**.**

Proof.

Appendix A General Variant of challenge-DP

Definition A.1**.**

Appendix B A Coin Flipping Game

Theorem B.1** ([Gupta et al., 2010, Kaplan et al., 2021]).**

Appendix C Extension to the Agnostic Case

Theorem C.1** ([Ben-David et al., 2009]).**

Definition C.2**.**

Lemma C.3**.**

Proof sketch.

Theorem C.4**.**

Proof sketch.

Definition 1.1 (Online Learnability: Realizable Case).

Remark 1.2.

Definition 2.1 ((Dwork et al., 2006)).

Definition 2.2.

Definition 2.3 (Sensitivity).

Theorem 2.4 (The Laplace Mechanism (Dwork et al., 2006)).

Definition 2.5 ((Kearns et al., 2015)).

Example 2.6 ((Nahmias et al., 2019)).

Definition 2.7 (DP under adaptive queries (Dwork et al., 2006; Bun et al., 2017)).

Theorem 2.8 ((Dwork et al., 2009; Hardt and Rothblum, 2010; Kaplan et al., 2021)).

Definition 2.9 (DP under adaptive inputs (Dwork et al., 2006, 2010a; Chan et al., 2010; Kaplan et al., 2021; Jain et al., 2021)).

Theorem 2.10 (Private counter (Dwork et al., 2010a; Chan et al., 2010; Jain et al., 2021)).

Definition 3.1.

Remark 3.2.

Theorem 3.3 (special case of (Dwork et al., 2010b)).

Theorem 3.4.

Remark 4.1.

Theorem 4.2.

Theorem 4.3.

Theorem 4.4.

Theorem 4.5.

Definition A.1.

Theorem B.1 ([Gupta et al., 2010, Kaplan et al., 2021]).

Theorem C.1 ([Ben-David et al., 2009]).

Definition C.2.

Lemma C.3.

Theorem C.4.