Don't Look at the Data! How Differential Privacy Reconfigures the   Practices of Data Science

Jayshree Sarathy; Sophia Song; Audrey Haque; Tania Schlatter; Salil; Vadhan

arXiv:2302.11775·cs.HC·February 24, 2023

Don't Look at the Data! How Differential Privacy Reconfigures the Practices of Data Science

Jayshree Sarathy, Sophia Song, Audrey Haque, Tania Schlatter, Salil, Vadhan

PDF

TL;DR

This paper explores how differential privacy impacts data science practices through interviews, revealing benefits for data access but also challenges and ethical questions in workflow integration.

Contribution

It provides empirical insights into data practitioners' perceptions of differential privacy and offers suggestions for better integration into data science workflows.

Findings

01

DP enables wider access to sensitive data

02

DP introduces challenges at all data science stages

03

Ethical and governance issues emerge with DP use

Abstract

Across academia, government, and industry, data stewards are facing increasing pressure to make datasets more openly accessible for researchers while also protecting the privacy of data subjects. Differential privacy (DP) is one promising way to offer privacy along with open access, but further inquiry is needed into the tensions between DP and data science. In this study, we conduct interviews with 19 data practitioners who are non-experts in DP as they use a DP data analysis prototype to release privacy-preserving statistics about sensitive data, in order to understand perceptions, challenges, and opportunities around using DP. We find that while DP is promising for providing wider access to sensitive datasets, it also introduces challenges into every stage of the data science workflow. We identify ethics and governance questions that arise when socializing data scientists around new…

Tables3

Table 1. Table 1 . Summary of total participant sample.

Participant	Role	Background
D1	Depositor	Researcher
A2	Analyst	Researcher
A3	Analyst	Researcher
D4	Depositor	Researcher
D5	Depositor	Researcher
A6	Analyst	Researcher
D7	Depositor	Researcher
A8	Analyst	Researcher
A9	Analyst	Researcher
A10	Analyst	Data archive administrator
D11	Depositor	Data archive administrator
D12	Depositor	Data archive administrator
D13	Depositor	Data archive administrator
A14	Analyst	Data archive administrator
A15	Analyst	Data archive administrator
A16	Analyst	Data archive administrator
D17	Depositor	Data archive administrator
D18	Depositor	Data archive administrator
A19	Analyst	Data archive administrator

Table 2. Table 2 . Summary of research questions, main findings, and implications regarding the use of DP by data practitioners for providing open access to sensitive datasets.

Research Question	Finding	Implications
What barriers do non-experts in DP face when using DP to share or analyze sensitive datasets? (RQ1)	Understanding the reasoning behind or implications of making choices (Sec 5.1.1)	Design and research around DP data analysis tools (Sec 6.4): (1) Provide more explanations behind selecting parameters and making decisions (2) Create workflows around trust and safety for both data practitioners and data subjects (3) Develop features for automated or depositor-led data contextualization
	Conducting analyses without access to raw data (Sec 5.1.2)
	Assuming new risks and responsibilities (Sec 5.1.3)
	Integrating DP into data analysis pipelines (Sec 5.1.4)
What do data practitioners perceive to be the potential utility of DP for expanding access of sensitive data to the public, facilitating exploratory data analysis, and enabling replication? (RQ2)
	Beneficial for wider access to the public (Sec 5.2.1)
	Poses challenges for exploratory analysis (Sec 5.2.2)
	Will not necessarily enable better replication of scientific studies (Sec 5.2.3)
	Requires training, expertise, and governance (Sec 5.2.4)
		Differentially private data science (Sec 6.3): (1) Information flows from data depositors to analysts (2) Guidance from DP experts (3) Context-specific education (4) Governance of privacy-loss parameters
What changes need to be made in the data science workflow to overcome the barriers from RQ1 and achieve benefits from RQ2? (RQ3)	DP changes every aspect of the data analysis workflow. (Sec 6.1)
	Socializing users around the constraints of DP raises questions around ethical and epistemic implications (Sec 6.2)

Table 3. Table 3 . Codebook that resulted from our thematic analysis of contextual inquiry and interviews with data practitioners.

Theme / Code
Comprehension	Making decisions
Understood privacy-loss budget	Difficult to check assumptions
Grasped privacy vs. accuracy tradeoff	Did not know variable coding
Did not comprehend purpose of metadata parameters	Wanted to perform sanity checks
Asked for reading on parameters	Wanted to visualize data distribution
Misconceptions about how parameters affect privacy	Misremembered variable details
Conflated privacy and accuracy parameters	Misinterpreted error guarantee
Maintained default values	Confused about scale of noise
Questioned technical parameters	Satisified with outputs
Asked to see algorithm code
Risks	Pipeline
Worried about being held liable	Asked about pre-processing
Lack of confidence in own choices	Wanted graphical workflow
Hesitant to validate dataset	Wanted integration with scripts
Unsure about making privacy decisions	Disliked switching workflows
Cognizant of limited privacy-loss budget	Wanted to export statistics
	Asked about citing statistics
Public access	Exploration
Beneficial for public	Lack of contextualization
More efficient access	Preferred synthetic data
Communicating about datasets	Need to know result instead of searching
Usable for understanding data
Replication	Education
Invisible intermediate steps	Not suitable for novices
Lack of trust in findings	Need information outside of tool
Noise makes it hard to confirm	Want explanations about the “why”
	Workshops and trainings

Equations2

Pr [M (d, ε, δ, h p) \in E] \leq e^{ε} \cdot Pr [M (d^{'}, ε, δ, h p) \in E] + δ

Pr [M (d, ε, δ, h p) \in E] \leq e^{ε} \cdot Pr [M (d^{'}, ε, δ, h p) \in E] + δ

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Don’t Look at the Data! How Differential Privacy Reconfigures the Practices of Data Science

Jayshree Sarathy

[email protected]

Harvard John A. Paulson School for Engineering and Applied Sciences and OpenDP150 Western AveBostonMA02134USA

,

Sophia Song

UC BerkeleyBerkeleyCAUSA

,

Audrey Haque

Harvard John A. Paulson School for Engineering and Applied Sciences, Harvard Graduate School of Design, and OpenDP150 Western AveBostonMA02134USA

,

Tania Schlatter

Harvard Institute for Quantitative Social Sciences and OpenDPCambridgeMAUSA

and

Salil Vadhan

Harvard John A. Paulson School for Engineering and Applied Sciences and OpenDP150 Western AveBostonMA02134USA

(2023)

Abstract.

Across academia, government, and industry, data stewards are facing increasing pressure to make datasets more openly accessible for researchers while also protecting the privacy of data subjects. Differential privacy (DP) is one promising way to offer privacy along with open access, but further inquiry is needed into the tensions between DP and data science. In this study, we conduct interviews with 19 data practitioners who are non-experts in DP as they use a DP data analysis prototype to release privacy-preserving statistics about sensitive data, in order to understand perceptions, challenges, and opportunities around using DP. We find that while DP is promising for providing wider access to sensitive datasets, it also introduces challenges into every stage of the data science workflow. We identify ethics and governance questions that arise when socializing data scientists around new privacy constraints and offer suggestions to better integrate DP and data science.

privacy, utility, open access, data practitioners, data analysis

††copyright: acmcopyright††journalyear: 2023††copyright: acmlicensed††conference: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems; April 23–28, 2023; Hamburg, Germany††booktitle: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23), April 23–28, 2023, Hamburg, Germany††price: 15.00††doi: 10.1145/3544548.3580791††isbn: 978-1-4503-9421-5/23/04††ccs: Security and privacy Social aspects of security and privacy††ccs: Security and privacy Usability in security and privacy††ccs: Security and privacy Privacy-preserving protocols

1. Introduction

Researchers, government agencies, and companies are increasingly expected to share their datasets with other researchers and the public (Burwell et al., 2013; Gherghina and Katsanidou, 2013; Vlaeminck et al., 2015). Data repositories such as Dataverse (Crosas, 2011; King, 2007) and Dryad (White et al., 2008) exist to promote such data sharing by ingesting and preserving datasets to be shared in the long-term. Yet, many datasets contain sensitive information about individuals. Over the last two decades, increases in computational power and availability of data sources have enabled new threats to the privacy of sensitive datasets, and it has been shown that heuristic anonymization techniques, such as removing personally identifiable information or only releasing aggregate statistics, do not adequately protect privacy (Narayanan and Shmatikov, 2008; Dwork et al., 2017). Growing calls for open access along with growing threats to privacy mean that data stewards are caught in a bind: either they release data and potentially compromise privacy, or they must put in place restrictive and costly mechanisms before allowing researchers access to their datasets.

To ease this tension, many organizations are turning to formal privacy frameworks such as differential privacy (DP) (Dwork et al., 2006) to protect privacy while offering wider access to rich datasets. But even as DP has gained prominence through high-profile deployments at Google (Erlingsson et al., 2014), Apple (Greenberg, 2016), and the U.S. Census Bureau (Abowd, 2018; Machanavajjhala et al., 2008), these deployments have also illuminated the significant challenges in bringing privacy-preserving data science from theory to practice. Scholars have begun to highlight the tensions between DP and data science, such as differing conceptions of risk, and ways in which DP data analysis clashes with ingrained workflows and modes of interaction within statistical agencies (Drechsler, 2022; Oberski and Kreuter, 2020). However, these scholars also point out that DP may lead to better scientific research by protecting against p-hacking and introducing robustness into the data analysis process (Dwork and Lei, 2009; Dwork et al., 2015; Oberski and Kreuter, 2020). Others foreground the “scant attention given to socialization of [privacy] tools” (Gürses and Del Alamo, 2016) and epistemic disconnects about data exposed by DP (boyd and Sarathy, 2022) as obstacles for bringing privacy-preserving methods from the lab into the world. These works highlight critical gaps between theory and practice of privacy-preserving data science that are crucial to explore further.

In this work, we offer insight into opportunities and challenges of DP from the perspective of data analysts, depositors, and administrators.111See Section 2.3 for a description of these roles, and Section 4 for more details on our participant sample. In the HCI community, studies have started to explore challenges for design, communication, and governance of privacy-preserving data analysis tools, but they have largely done so through user studies examining the perspectives of data subjects (Cummings et al., 2021; Xiong et al., 2020; Bullek et al., 2017; Smart et al., 2022) and developers of DP software (Agrawal et al., 2021). Less studied are the needs of data practitioners, such as data depositors, administrators, and analysts, who are non-experts in DP with varying technical backgrounds, and who are experienced in sharing and analyzing sensitive datasets, particularly in the context of research data repositories. In addition, we consider the utility of differentially private data analysis tools towards the broader aims often stated for the use of DP, including exploratory data analysis, replication of scientific studies, and wider access to the public (Gaboardi et al., 2016).

In particular, we consider the following research questions.

RQ1: What barriers do data practitioners who are non-experts in DP face when using DP to share or analyze sensitive datasets?
RQ2: What do data practitioners who are non-experts in DP perceive to be the potential utility of DP for expanding access of sensitive data to the public, facilitating exploratory data analysis, and enabling replication of scientific studies?
RQ3: What changes need to be made in the data science workflow in order to address the barriers and realize the benefits (from RQ1 and RQ2) of DP?

Our study consisted of semi-structured interviews with 19 data depositors, administrators, and analysts as they used a technical probe (Hutchinson et al., 2003) – a software prototype called DP Creator – to make statistical releases, in order to understand perceptions, challenges, and opportunities around differentially private data analysis. The findings from these interviews were supplemented and framed by our study team members’ expertise in the foundations of DP. Based on these sources, our study presents the following contributions:

•

Through interviews with these participants, we provide insight into challenges of applying DP to share or analyze datasets, such as: (1) understanding the reasoning behind and implications of decisions regarding privacy and utility, (2) conducting analysis without access to raw data, (3) assuming new risks and responsibilities, and (4) integrating DP with upstream and downstream data analysis pipelines. (Section 5.1)

•

Our interviews also highlight the potential utility of DP for advancing goals of safe, open access to sensitive datasets. Participants were optimistic about expanding access to the general public, but pointed out challenges for using DP for research purposes. In particular, they expressed that training and expertise is still required to guide decision-making, and the constraints of DP make it hard to use for exploratory data analysis and replication of studies. (Section 5.2)

•

We discuss how DP requires modifications at every stage of the data science workflow. We discuss ethical, epistemic, and governance questions raised by DP for data science as a mode of knowledge production. We provide suggestions for integrating DP and data science, including: (1) more information flow between depositors and analysts, (2) consultation and guidance from experts in DP, (3) context-specific education, (4) governance of privacy-loss parameters through trained administrators. (Sections 6.1-6.3)

•

Finally, although not directly validated by our study design, we offer suggestions for research and design of DP data analysis tools, including: (1) providing more explanation behind selecting parameters, (2) creating workflows around trust and safety for both data practitioners and data subjects, and (3) developing features for automated and depositor-led data contextualization. (Section 6.4)

The paper proceeds as follows. We begin in Section 2 by explaining DP, its details in practice, and the DP Creator prototype. In Section 3, we discuss related work on this topic. Next, in Section 4, we provide a description of methods, data collection, and analysis. In Section 5, we discuss themes from our interviews and contextual inquiry. Based on these findings, we conclude in Section 6 with a discussion about the ways in which the practices of data science itself must be reconfigured in order to comply with the constraints of DP, suggestions for more smoothly integrating DP and data science, and suggestions for design of DP data analysis tools.

2. Background

In this section, we provide background on differential privacy and the DP Creator prototype that we use as a technical probe (Hutchinson et al., 2003) in our study. Based on our understanding of the foundations of DP, we also describe the information that interfaces for DP data analysis, such as DP Creator, require from data practitioners.

2.1. Differential Privacy

Differential privacy (DP), introduced by Dwork, McSherry, Nissim, and Smith in 2006 (Dwork et al., 2006), is a mathematical definition of privacy that limits how much information a mechanism for making a statistical release reveals about any one individual in the dataset. The definition is parametrized by two quantities — $\varepsilon$ and $\delta$ — that denote the ‘privacy loss’ incurred by running a given set of analyses on the data. In order to satisfy a guarantee of small privacy loss, the mechanism must introduce carefully calibrated noise to any computation over the data.

We provide the formal definition of DP below. Let $\mathcal{D}$ be a data universe and $\mathcal{D}^{n}$ be the space of datasets of size $n$ . Two datasets $d,d^{\prime}\in\mathcal{D}^{n}$ are neighboring, denoted $d\sim d^{\prime}$ , if they differ on a single record. Let $\mathcal{H}$ be a hyperparameter space and $\mathcal{Y}$ be an output space.

Definition 2.1 ((Dwork

et al., 2006)).

A randomized mechanism $\mathcal{M}:\mathcal{D}^{n}\times\mathbb{R}_{\geq 0}\times[0,1]\times\mathcal{H}\rightarrow\mathcal{Y}$ is $(\varepsilon,\delta)$ -differentially private if for all datasets $d\sim d^{\prime}\in\mathcal{D}^{n}$ , privacy loss parameters $\varepsilon\geq 0$ , and $\delta\in[0,1]$ , hyperparameters $hp\in\mathcal{H}$ , and events $E\subseteq\mathcal{Y}$ ,

[TABLE]

where all probabilities are taken over the random coins of $\mathcal{M}$ .

The mathematical formalization of DP accounts for current and future attacks, remains robust to arbitrary auxiliary information, and measures compositions of privacy loss over multiple data releases (Dwork et al., 2014). In addition, unlike heuristic approaches to privacy that rely on security by obscurity, DP enables third-party scrutiny of the algorithm, which makes it possible for a data analyst to take into account the noise introduced when performing inference and estimating uncertainty through confidence intervals.

DP has become a gold standard for measuring and controlling privacy loss for statistical releases on sensitive datasets. Disclosure methods that satisfy DP have been adopted by a variety of data-collection agencies and institutions, including Google (Erlingsson et al., 2014), Apple (Greenberg, 2016), Uber (Near, 2018), and the U.S. Census Bureau (Machanavajjhala et al., 2008; Abowd, 2018).

2.2. Applying DP in practice

Using DP requires data practitioners (ie. data depositors and/or data analysts, as defined in Section 2.3) to provide information about the dataset or data domain, and to make choices about privacy and utility. Many conversations around using DP in practice focus on selecting the privacy-loss parameters ( $\varepsilon$ and $\delta$ ), but in reality, setting these parameters is just one of several decisions and choices that practitioners must make during the data analysis process. Based on our collective expertise222We rely on this expertise to frame our main findings in this paper (ie. Sections 5-6), which are primarily derived from our interviews with data practitioners. conducting research in the foundations of DP, designing DP tools, and working with data practitioners, we made a list of parameters that practitioners may need to choose within a DP data analysis interface. These include the following:

•

Validating that the dataset is appropriate for DP, for example, by confirming that it contains moderately333Data that is not very sensitive can potentially be made available without using DP, and data that is highly sensitive may need access control mechanisms beyond just using DP and additional review. sensitive data about individuals.

•

Providing information about how the data was sampled, such as whether it was a simple random sample from a larger population (which, when supplied with the size of this larger population, would amplify the privacy guarantee (Balle et al., 2018)), or whether a data-dependent sampling scheme was used (which might degrade the privacy guarantee (Bun et al., 2022)).

•

Selecting privacy-loss parameters, $\varepsilon$ and $\delta$ ; the smaller the parameters, the more privacy is retained in the release. This is also called a privacy-loss budget.

•

Selecting metadata parameters, such as ranges for numerical variables and categories for categorical variables. These parameters are necessary for limiting the amount of noise added to a statistic. It is important to set these inputs carefully for both utility and privacy: overly large ranges or extensive categories may lead to high variance in the noisy statistic, while overly narrow ranges or limited categories may lead to biased outputs. Crucially, ranges and categories should not be derived from the values in the dataset itself; data-dependent parameter selection constitutes a potential privacy risk. Rather, they should be set according to a codebook for the dataset or knowledge about the population independent from the dataset. For example, the range for an ‘age’ variable could be set to 0-110 based on general knowledge about human lifespans, but it should not be set to values informed by the private dataset itself, such as 5-96.

•

Allocating the privacy-loss budget amongst different statistics. Researchers typically run many analyses on a single dataset. Using the composition theorems in DP (Dwork et al., 2006; Kairouz et al., 2015; Murtagh and Vadhan, 2016), a software tool can analyze the privacy loss over multiple statistical releases. However, the data depositor may wish to distribute a global privacy-loss budget over many data analysts or releases, and the data analysts will need to distribute their allocated portion of the budget over multiple statistics.

The need to make these choices is well-known to experts in DP, but often gets lost in the discussions around deployments. As indicated by our findings in Section 5 and the discussion in Section 6, these decisions pose challenges for non-experts in DP throughout the data analysis process.

2.3. DP Creator

Our study involves observing participants as they use a prototype of a differentially private data analysis tool called DP Creator, which is part of the OpenDP open-source software project (Gaboardi et al., 2020). DP Creator is a software tool that aims to make it easier for people who are not experts in privacy, statistics, or computer science to reap the benefits of DP when sharing their data. (Note that our study team is affiliated with OpenDP, and some of the authors were involved in designing the DP Creator tool.) While other DP systems are designed to optimize utility for specific data releases or specific algorithms, DP Creator seeks to be a general-purpose tool for privacy-preserving data science that is compatible with existing workflows in social science research.444DP Creator evolved from a tool called PSI: a Private data Sharing Interface, the white paper for which (Gaboardi et al., 2016) provides more details on the motivations and design of DP Creator. See also Murtagh et al. (Murtagh et al., 2018) for an initial usability study of PSI. The tool has three interfaces according to three different user roles: data depositor, data analyst, and data archive administrator.

•

The data depositor interface is for a researcher who has collected sensitive data about individuals and who would like to analyze and share summary statistics about this data using the tool. The depositor also provides metadata information to facilitate future DP analyses of the dataset beyond her initial release.

The data depositor uploads her data to the tool directly or through an online data repository such as Dataverse (Crosas, 2011). If possible, she should also upload a codebook or schema for the dataset. She must first validate that the dataset is appropriate for applying DP (as shown in Figure 1(b)): for example, it should contain moderately sensitive information about individuals, and the dataset should be organized such that each row corresponds to an individual (because DP provides an individual-level privacy-loss guarantee). Then, she must answer whether the dataset is a simple random sample from a larger population, and if so, input the size of the larger population.555The DP Creator prototype only asks users about simple random sampling; it has not yet incorporated recent work on data-dependent sampling schemes and their effects on the privacy guarantee (Bun et al., 2022). Next, she must set the global privacy-loss parameters for the dataset. The prototype currently allows the depositor to select a “data security classification” level for the dataset according to the organization’s research data guidelines,666For example, see https://security.harvard.edu/data-security-levels-research-data-examples which it then matches to pre-set values of $\varepsilon$ and $\delta$ . 777These parameters were decided and hard-coded by the designers of DP Creator. We find these parameters reasonable, but it is important to note that there is currently no consensus around how to select $\varepsilon$ and $\delta$ parameters, nor is it a widespread practice to set these parameters according to data security classification levels. For example, if the depositor selects “Information that could cause risk of material harm to individuals or the university if disclosed,” the prototype sets $\varepsilon=1.0$ and $\delta=10^{-6}$ . For an in-depth discussion around selecting privacy-loss parameters, see Dwork et al. (Dwork et al., 2019). The depositor can also set or adjust these parameters manually.888The prototype offers guidelines for setting parameters manually, such as choosing $\varepsilon$ to be between $0.1$ and $1.0$ and $\delta$ to be smaller than $10^{-6}$ . The depositor is prompted to leave a portion of the global privacy-loss budget for future analysts who would like to submit queries to this dataset. Next, the tool will present the variables in the dataset, their labels and types, and any metadata information (such as ranges, categories, or descriptions) available in the codebook (See Figure 1(c)). The data depositor must adjust these ranges and categories according to her preferences and guidelines from the prototype; for example; the prototype explains that ranges that are too large will increase variance of the noisy statistic, while ranges that are too small will lead to biased outputs.999Note that the prototype does not offer a way for the depositor to select metadata parameters based on a well-defined calculation of the bias-variance tradeoff, but simply states that a tradeoff of this form exists. The prototype also reminds the depositor that these metadata parameters should not be selected based on information from the dataset itself. Next, the depositor chooses which statistics to compute over subsets of the variables (See Figure 1(d)). Once the depositor has chosen all of her desired statistics, she should adjust how much of the privacy-loss budget for this initial release is allocated to each statistic based on how this affects the error on each statistic (spending more budget leads to less error on that statistic). Finally, the depositor confirms that she is ready to submit her statistics and is able to share the output of the DP analysis.

•

The data analyst interface is for a researcher or member of the public who does not have access to the raw data, but who would like to receive summary statistics about the data (perhaps to decide whether or not to go through the lengthy process of applying for access to the raw data).

First, the data analyst must set select the dataset she wishes to explore using DP. The prototype then presents the variables in the dataset, their descriptions, and any metadata parameters (such as ranges, categories, or descriptions) as inputted by the data depositor. The data analyst may also be able to refer to a codebook for the dataset, if this has been uploaded by the depositor. The data analyst must select the variables she wishes to work with, and can adjust the ranges and categories according to her preferences (again, the prototype explains to the analyst that a general bias-variance tradeoff exists with respect to the selection of these parameters). The privacy-loss parameters allocated to this specific analysis are already set by the data depositor; the data analyst cannot change these values. Then, the analyst chooses which statistics to compute over subsets of the variables. Once the analyst has chosen all of her desired statistics, she should adjust how much of the privacy-loss budget for this analysis is allocated to each statistic based on how this affects the error on each statistic (again, spending more budget leads to less error on that statistic). Finally, the analyst confirms that she is ready to submit her statistics and receives the output of the DP analysis.

•

The data administrator interface is for a trained data curator or research data librarian who manages the different datasets and access policies for a given data archive. The data administrator, with consultation with data depositors, may be tasked with approving or denying requests from researchers to either access DP summary statistics released by the depositor, or to conduct their own DP analyses, on sensitive datasets. The administrator may also manage the allocation of the privacy-loss budget amongst different analysts.

In this study, we consider the first two interfaces, as these are the ones available in the DP Creator prototype. We also consider the three main use cases that the prototype aims to support: exploratory data analysis, replication of scientific studies, and wider access of sensitive datasets for the public (Gaboardi et al., 2016). The current version of the prototype and a tutorial101010The tutorial for using this prototype can be found at the following link: https://docs.google.com/document/d/e/2PACX-1vRlZ2IgigIhl4oz_uOakQPxovzlrmFkbD-x_9RUO31dC0eRq2wCt_vN2Go0_9LTRd67srjgy04CfPVk/pub can be accessed at the following link: https://demo-dataverse.dpcreator.org/; note that this version may have minor changes compared to the version of the prototype used in our study.

3. Related Work

We situate our work within three areas in privacy and HCI literature: privacy communication, defining users of privacy tools, and translating ethical data science from theory to practice.

3.1. Privacy Communication

Several recent works have explored communication of DP guarantees to data subjects who may be faced with the decision of whether or not to have their data collected. Bulleck et al. (Bullek et al., 2017) explore how data subjects understand and interact with privacy-loss parameters for survey responses. After being given a sensitive question, participants were asked to choose a level of perturbation (ie. privacy) to add to their response. Interestingly, some participants chose the lowest level of privacy for their response, revealing the unexpected behavior that can emerge when participants do not understand the norms of practice that are implicit within a given privacy technique. Xiong et al. (Xiong et al., 2020) explore data disclosure decisions from data subjects, elucidating reasons that users choose to share their data. They find that more explanation of the implications of DP yields further sharing of sensitive data; however, they also find that even when users claim to understand the explanations, objective measurements show that the rates of comprehension are quite low. Cummings et al. (Cummings et al., 2021) also examine the dynamics of comprehension and data sharing from data subjects. They find that users care about different kinds of information leakage, and that the manner in which DP is communicated sets privacy expectations accordingly. Most recently, Smart et al. (Smart et al., 2022) investigate explanations of DP that hide important information such as privacy-loss parameters. Interestingly, they find that explanations have little effect on individual’s willingness to share, which is determined even before they learn about the privacy protections at hand.

These works offer important lessons for communication of DP guarantees to end users. They point out several ways in which users can misunderstand the common explanations used regarding DP, and they caution that we must be intentional with the way explanations are conveyed, for such explanations can affect participants’ motivations for data sharing in subtle or unexpected ways. However, while these works consider communication for end users who are data subjects, our work takes end users to be DP practitioners who may not have the ability to “opt out” if they are unsure about the protections offered, as data subjects often can, but rather must make critical decisions about how to handle people’s sensitive data.

3.2. Defining users of privacy tools

Although less studied, recent work has started to examine other perspectives of users who are not data subjects. Agrawal et al. (Agrawal et al., 2021) argue that when considering privacy-preserving computation, which is typically far removed from the data subject or app user, the notion of “user” must be reconceptualized to include designers, developers, and policymakers who work closely with privacy-preserving computational tools. Based on interviews with these parties, the authors find that tools for privacy-preserving computation remain a mystery to many who interact with them. This poses pressing questions for governance and design of these tools. Our work argues that in the case of DP, the tool may similarly produce unexpected behavior from data practitioners (that can be detrimental to privacy and/or utility) unless norms of practice are conveyed both within and outside of the user interface, through direct engagement and context-aware educational guidance.

Nanayakkara et al. (Nanayakkara et al., 2022) also consider the perspective of research practitioners in their study. They interview participants and evaluate their comprehension as they use a visualization tool to explore the relationships between $\varepsilon$ , accuracy, and disclosure risk, as well as the impact of DP noise on statistical inference. This work underscores the importance of tools for practitioners to understand the dynamics between privacy and accuracy and the implications of making parameter choices in a hands-on manner. Their tool is complementary to the prototype we use for this study, and our findings show that participants would benefit from being able to visualize these tradeoffs. However, as the authors point out, more work needs to be done to understand how generating visualizations cuts into the privacy-loss budget, and how to use the visualization to actually perform a DP analysis.

Concurrent work by Munilla Garrido et al. (Garrido et al., 2022) investigates the ‘academic-industrial DP utilization gap’ through interviews with data analysts and data stewards in major companies that have not yet deployed DP. Their findings about the barriers to adoption of DP, as well as the promises of using DP, support the findings in this paper. In particular, Munilla Garrido et al. conclude that DP can simplify onerous data access processes and facilitate data exploration across silos. They are optimistic about bridging the divide between theory and practice of DP, identifying key technical gaps that DP tools should address in order to promote adoption across industry. Echoing findings from a previous interview study by Dwork et al. (Dwork et al., 2019), Munilla Garrido et al. emphasize the need for sharing learnings across the DP community.

Finally, Qin et al. (Qin et al., 2019)’s work on multi-party computation (MPC), while not about DP, offers valuable insights on how to educate stakeholders about technical privacy guarantees “while [keeping practitioners] insulated from the nuances of [the protocols’] implementation.” Qin et al. find that describing simple examples of the MPC protocol to practitioners allows them to trust the protocol’s security on more complicated functions. In addition, modeling the user interface on something that is familiar, such as a spreadsheet, allows practitioners to participate more confidently. Finally, Quin et al. emphasize the importance of collaboration between security engineers and human factors experts in designing interfaces, particularly in settings (including MPC and DP) where users can only interact with the interface only a limited number of times.

More generally, scholars have highlighted the ways in which computational tools shape the identity of users as well as the scope of their actions. Grint and Woolgard (Grint and Woolgar, 1997) describe this process in detail, highlighting that “by setting parameters for the user’s actions, the evolving machine effectively attempts to configure the user.” We build on these insights in this work to show how DP tools may seek, but fail, to reconfigure data analysts in the desired manner. We argue that reconfiguring users to make safe, ethical choices in the data analysis process requires not only better design within the tool, but also robust guidance, training, and education outside of the tool.

3.3. Ethical data science

Finally, there has been a slew of recent work on the challenges of translating ethical data science from theory to practice. Seda Gürses (Gürses and Del Alamo, 2016) draws attention to the “scant attention given to socialization of [privacy] tools” as an obstacle for bringing privacy-preserving methods from the lab into the world. Jörg Dreschler (Drechsler, 2022) highlights several ways in which differentially private data analysis clashes with fundamental workflows and modes of interaction within government statistical agencies. Oberski and Kreuter (Oberski and Kreuter, 2020) similarly delineate the tensions between DP and social science, such as conceptions of risk and workflow, but they also outline ways in which DP may lead to better social science by protecting against p-hacking and introducing robustness into the data analysis process. Barocas and boyd emphasize the tensions between ethicists who critique the ‘ethically uninformed’ practices of data science, and data scientists who feel that they already embed nuanced ethical considerations into every step of their analyses (Barocas and Boyd, 2017). Finally, boyd and Sarathy (boyd and Sarathy, 2022) describe the challenges faced by the Census Bureau in communicating about DP as it conflicts with dominant ‘statistical imaginaries’ about census data.

These works show that there are significant gaps between theory and practice of ethical data science which are crucial to explore further. Our work offers insights into barriers to adoption of differentially private data science, and discusses how we can pave the way towards a smoother uptake of DP in practice.

4. Methods

Our research approach consisted of both a think-aloud protocol as well as in-depth, semi-structured interviews with data practitioners. As DP is still considered an emerging technology, and there has not been much previous literature that empirically considers DP tools from the perspective of data practitioners, our research questions were open-ended and exploratory. At the same time, since our goal was to explore DP in practice, we decided to engage our participants using a technology probe (Hutchinson et al., 2003). We observe participants as they use DP Creator, the prototype of a differentially private data analysis tool described in Section 2.3.

4.1. Recruitment

Participants were recruited through a combination of convenience and purposive sampling (Etikan et al., 2016). We sought participants who were not experts in DP, but who were experienced in working with sensitive data. This included researchers from labs that had attempted to expand access to their own sensitive datasets, researchers who had experience applying for access to and conducting analyses on sensitive data, and data archive administrators who managed access to sensitive datasets. We relied on the network of the open-source software project OpenDP (Gaboardi et al., 2020) in order to identify organizations with use-cases of data that may be suitable for differentially private data analysis. In these organizations, there was typically one or two contacts who were well aware of and at least somewhat experienced with differential privacy. We asked these contacts to point us to other members of the organization who had little to no prior experience with differential privacy. (See Section A for an example of screening questions.) For example, one research group leader we reached out to through the OpenDP network responded to us that “I think [name of analyst in their research group] will be perfect - very data savvy, but without direct experience with DP.” Then, through open-ended exchanges with these potential participants, we asked them to tell us about their familiarity with differential privacy in order to confirm, using our own judgment, that the participants had no direct experience with or training in DP; for example, as one of our participants said, they “have heard of DP but don’t know what it means or how it works.”

We interviewed 19 participants in total: 9 academic researchers in social science disciplines and 10 curators of research data repositories across a few different US and international institutions (for example, Dataverse repositories (Crosas, 2011)). The summary of the total participant sample is shown in Table 1.

4.2. Data

We asked participants who were researchers (9 out of 19) to provide simulated or actual sensitive data that their research group had collected; out of these, 2 provided real data, 4 provided simulated data, and 3 did not provide data. The 2 participants that supplied real data were part of the same research group, which had collected this data as part of an online service deployed by the group. See Section 4.4 for more details on ethics considerations around the use of these data and measures taken to protect data subjects. The 4 participants that supplied simulated data were all part of a second, shared research group, and this synthetic data was created (using a statistical model applied to the real data) to have similar characteristics to their groups’ real data. For the rest of the participants (3 researchers and 10 administrators), we uploaded into the prototype a 2010 American Community Survey (ACS) Public Use Microdata Sample (PUMS) dataset that contains demographic and income information about California residents.111111Such samples are publicly available at https://www.census.gov/programs-surveys/acs/microdata.html. Note that ACS PUMS files are already privacy-protected using disclosure avoidance techniques applied by the Census Bureau. However, for the purposes of our study, we treated this dataset as containing unprotected, sensitive information about individuals. In addition, note that ACS data are not weighted according to a simple random sampling design, but this was not taken into account within the prototype. To learn more about ACS design and methodology, see https://www.census.gov/programs-surveys/acs/methodology/design-and-methodology.html

We ensured that all of the datasets were suitable for use with differential privacy in terms of size, sensitivity, and format. The sizes of the datasets varied, but they were all large enough to provide good utility for differentially private statistics ( $>5,000$ observations) yet small enough for the DP Creator prototype to handle efficiently ( $<50,000$ ); we asked the dataset holders to provide a subset of the data if it was greater than this size. The datasets all contained demographic information about (possibly simulated) individuals that could be potentially sensitive such as age and sex, and therefore required some privacy protection, but they did not contain attributes that would be considered highly sensitive such as medical information. Finally, all of the datasets were formatted such that each row corresponded to one individual and all of their attributes; this allowed for a standard application of differential privacy without additional dataset transformations.

4.3. Study Protocol

Our study protocol varied based on each participants’ background, role, and data used. Below, we describe our general study protocol and where we made these variations.

We began by asking participants about their skills, background and experience. Based on this information, we assigned her a role as a data depositor or data analyst. For example, if a researcher had spent significant time working with the particular dataset provided by their group for the study, we assigned her to the role of data depositor. Otherwise, if the researcher was not very familiar with the dataset, we assigned her to the role of data analyst. These role assignments mirrored a real-life scenario, in which a data depositor would be very familiar with the dataset, while an analyst would not have had any prior access to the dataset but would be familiar with the data domain.121212For example, the data analyst might be familiar working with ACS PUMS files regarding New York, but not California. Therefore, she might be familiar with how the data is coded and collected, but not with the relationships present in the data that may be specific to California. As our screening questions (see Appendix A) specifically asked contacts to point us to members of their research groups with high and low experience with the group’s dataset, we were able to categorize researchers into these the roles of depositor and analyst without much ambiguity.131313In smaller organizations, individuals may carry out the roles of both data depositor and data analysts, so there might not be much distinction between these roles. For the participants in our study, however, we found that there was only a one-directional overlap: the depositors nearly always participated in analysis, but the analysts were not always depositors. This allowed us to categorize participants accordingly. For the 3 remaining researchers and 10 data archive administrators, we assigned the role of depositor or analyst based on which of the two roles the participant had more experience working in or with; if this was unclear, we made chose a role that would balance out our numbers of depositors and analysts. Our total sample contained 9 depositors and 10 analysts.

Next, the participant was given a role-dependent task to carry out using DP Creator. If she was playing the role of a data depositor, we asked her to use DP Creator to release useful insights about the sensitive dataset to the public. If she was in the role of a data analyst, we asked her to use DP Creator to explore and discover new insights about the data. These tasks were designed to give the participant freedom while using the tool and to match the free-form environment that the participants are accustomed to when working with data (Tukey et al., 1977; Dasu and Johnson, 2003). If the participants’ research group had provided data for us to use, we knew that those participants were either familiar with the dataset itself and/or the data domain, and so we concluded that they would have a decent understanding as to which statistics they would want to release or explore. For participants using the California PUMS dataset, we gave them the following task (wording tailored for depositors):

Imagine that you are a researcher who has collected sensitive information about individuals in California, including attributes such as sex, race, marital status, and income level. Your dataset contains a simple random sample of 30,000 individuals from a population of 30 million individuals in California. Your goal is to use the DP Creator prototype to release privacy-preserving summary statistics that convey the main insights of your data. We’d like you to use the prototype to generate the following differentially private statistics:

(1)

Mean age of individuals in California; missing values should be replaced with the number 30. 2. (2)

OLS regression using $x$ =age and $y$ =sex as variables; missing values for age should be replaced with 30 and missing values for sex should be dropped.

We used a think-aloud protocol (Lewis and Rieman, 1993) to observe participants as they were interacting with the prototype: we encouraged participants to vocalize their perceptions, questions, and challenges while using the prototype. Once participants completed the task, we asked them a series of semi-structured, open-ended reflection questions, which were adapted based on the particular session and participant. The questions were geared towards understanding participants’ thoughts on using DP to share and analyze sensitive data, the utility of the DP releases, and potential use-cases of DP for providing open access to datasets in a safe manner. (See Appendix B for sample interview questions.) The think-aloud protocol combined with open-ended interview allowed us to explore both specific and high-level insights, many of which were relevant beyond the DP Creator prototype.

Interviews were recorded and transcribed; they lasted between 45 and 120 minutes, producing about 20 hours of audio and video recordings. The study team also took notes during the sessions. All parts are approved by an IRB.

4.4. Ethical Considerations

Two out of the 19 participants in our study used real data from their shared research group as input to the DP Creator prototype. We obtained permission from the research group leaders and participants to use this data and made sure that the consent form for the data collection permitted the use of the data in our study. Most importantly, we carefully considered the ethics of using this data from the point of view of protecting data subjects in our study. Below, we describe below the measures taken towards this goal.

Both participants used the same dataset, which was collected as part of an online service that their research team had deployed. The terms of service and privacy policy141414In order to maintain the anonymity of our participants, we are not able to attach these terms of service as supplementary material. However, we describe the relevant portions here. allowed for the sharing of personally identifiable information with other researchers for broad purposes, including improving the service for which the data was collected and for scientific research. The terms also allowed for the sharing of aggregate statistics that are not personally identifying with third parties and the public. However, in considering the ethics of using this real data for our study, we took steps well beyond what the terms of service and privacy policy required.

In particular, we made sure that the sensitive data was never exposed in the clear to anyone outside of the aforementioned two participants who brought this data to the study session. First, the DP prototype was installed on each of the participant’s machines and the analyses were conducted locally, which ensured the dataset remained in sole possession of the participants. This avoided any cloud or network security concerns with storing information about the dataset. Second, our study team never viewed, had access to, or collected any information directly about the sensitive data. We only saw or recorded what was visible through the DP Creator interface, which included the size of the dataset, metadata parameters including the variable names and types, and the differentially private outputs which we discuss more below.

We also made sure to limit anything the study team may have been able to glean about the sensitive data indirectly, which could have come from two sources: (1) the participant’s actions and think-aloud protocol, and (2) the differentially private aggregate statistics about the data that were produced by the DP Creator prototype. For (1), we reminded participants before starting that they should not discuss anything about the sensitive dataset that would reveal personally identifying information and/or information that was not already publicly available about the dataset. For example, the participant could talk to us about the accuracy of the differentially private analysis in qualitative terms, but they should not precisely indicate its relationship to the exact, non-DP statistic. If they had inadvertently done so, this information would have been removed from the recording, notes, and transcripts; however, this did not happen. For (2), we made sure that participants only used conservative privacy-loss parameters ( $\varepsilon=0.1,\delta=10^{-6}$ ) before proceeding with the analysis, and as the participant was moving through the workflow of the prototype, we corrected any errors they had made that could compromise the privacy guarantee in order to ensure that the differentially private aggregate statistics robustly preserved the privacy of data subjects. Out of an abundance of caution, we also made sure that our recordings, notes, and transcripts did not include the value of this final output, even though it was already protected by DP.

Finally, for the simulated data provided by 4 participants from their shared research group, we took similar but slightly less stringent measures. This data was fully synthetic, and had already gone through robust disclosure avoidance procedures, but we still wanted to be careful in protecting the privacy of these (simulated) data subjects. Therefore, we again made sure to comply with the data use agreement, obtain permission from the research group, only view and record what was visible through the DP Creator interface, and limit what we may have learned about the dataset indirectly from our participants.

4.5. Data Analysis

The notes, transcripts, and audio/video recordings from the sessions were analyzed by two of the study team members using a reflexive thematic analysis approach adapted from Braun & Clarke (Braun and Clarke, 2021). One of the two team members has expertise in the foundations of DP; the other was familiar with the broad concepts of DP, and had expertise in designing and evaluating user interfaces. The team members each spent extended time familiarizing themselves with the materials from each interview, taking down notes on additional points of interests to direct the analysis and reflecting on our own positionality with regard to DP data analysis. Guided by the research questions and these additional notes, the team members open-coded the data to develop semantic and latent sets of codes reported in Appendix C. Examples of codes from the interviews included: difficult to check assumptions about data, worried about being held liable for exposing data, and misconception about how metadata parameters affect privacy. These two authors then clustered the codes into sets of themes for each research question.151515As RQ3 considers broader implications of DP, our themes for this question integrated higher-level codes with our review of the literature. Over multiple discussions and iteration with the broader study team, these themes were refined into the final set of themes summarized in Table 5 and reported in Section 5.

5. Findings

Bibliography52

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Abowd et al . (2022) John Abowd, Robert Ashmead, Ryan Cumings-Menon, Simson Garfinkel, Micah Heineck, Christine Heiss, Robert Johns, Daniel Kifer, Philip Leclerc, Ashwin Machanavajjhala, Brett Moran, William Sexton, Matthew Spence, and Pavel Zhuravlev. 2022. The 2020 Census Disclosure Avoidance System Top Down Algorithm. Harvard Data Science Review Special Issue 2 (jun 24 2022), 42 – 79. https://hdsr.mitpress.mit.edu/pub/7evz 361i.
3Abowd (2018) John M Abowd. 2018. The US Census Bureau adopts differential privacy. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining . Association for Computing Machinery, New York, NY, USA, 2867–2867.
4Agrawal et al . (2021) Nitin Agrawal, Reuben Binns, Max Van Kleek, Kim Laine, and Nigel Shadbolt. 2021. Exploring design and governance challenges in the development of privacy-preserving computation. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems . Association for Computing Machinery, New York, NY, USA, 1–13.
5Altman et al . (2021) Micah Altman, Aloni Cohen, Kobbi Nissim, and Alexandra Wood. 2021. What a hybrid legal-technical analysis teaches us about privacy regulation: The case of singling out. BUJ Sci. & Tech. L. 27 (2021), 1.
6Balle et al . (2018) Borja Balle, Gilles Barthe, and Marco Gaboardi. 2018. Privacy amplification by subsampling: Tight analyses via couplings and divergences. Advances in Neural Information Processing Systems 31 (2018), 1–9.
7Barocas and Boyd (2017) Solon Barocas and Danah Boyd. 2017. Engaging the ethics of data science in practice. Commun. ACM 60, 11 (2017), 23–25.
8boyd and Sarathy (2022) danah boyd and Jayshree Sarathy. 2022. Differential Perspectives: Epistemic Disconnects Surrounding the US Census Bureau’s Use of Differential Privacy. Harvard Data Science Review Special Issue 2 (2022), 172–187.