Private Information Retrieval in Graph Based Replication Systems
Netanel Raviv, Itzhak Tamo, Eitan Yaakobi

TL;DR
This paper investigates private information retrieval protocols in graph-based storage systems, proposing a scheme that maximizes privacy against certain collusions and analyzing its efficiency and extensions.
Contribution
It introduces a 2-replication PIR scheme that guarantees privacy against acyclic collusions and provides bounds on its rate, extending to larger replication factors and coding.
Findings
Guarantees perfect privacy from acyclic sets
Achieves PIR rate within a factor of two of optimal for certain graphs
Extends results to larger replication factors and graph-based coding
Abstract
In a Private Information Retrieval (PIR) protocol, a user can download a file from a database without revealing the identity of the file to each individual server. A PIR protocol is called -private if the identity of the file remains concealed even if of the servers collude. Graph based replication is a simple technique, which is prevalent in both theory and practice, for achieving erasure robustness in storage systems. In this technique each file is replicated on two or more storage servers, giving rise to a (hyper-)graph structure. In this paper we study private information retrieval protocols in graph based replication systems. The main interest of this work is maximizing the parameter , and in particular, understanding the structure of the colluding sets which emerge in a given graph. Our main contribution is a -replication scheme which guarantees perfect privacy fromâŠ
| PIR rate | |||||
| Petersen | |||||
| Complete bipartite | Square | ||||
| Gen. polygons | |||||
| Murty | |||||
| Ramanujan | Any | Constant |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
\title
Private Information Retrieval in\Graph Based Replication Systems
Abstract
In a Private Information Retrieval (PIR) protocol, a user can download a file from a database without revealing the identity of the file to each individual server. A PIR protocol is called -private if the identity of the file remains concealed even if of the servers collude. Graph based replication is a simple technique, which is prevalent in both theory and practice, for achieving erasure robustness in storage systems. In this technique each file is replicated on two or more storage servers, giving rise to a (hyper-)graph structure. In this paper we study private information retrieval protocols in graph based replication systems. The main interest of this work is maximizing the parameter , and in particular, understanding the structure of the colluding sets which emerge in a given graph. Our main contribution is a -replication scheme which guarantees perfect privacy from acyclic sets in the graph, and guarantees partial-privacy in the presence of cycles. Furthermore, by providing an upper bound, it is shown that the PIR rate of this scheme is at most a factor of two from its optimal value for an important family of graphs. Lastly, we extend our results to larger replication factors and to graph-based coding, which is a similar technique with smaller storage overhead and larger PIR rate.
I Introduction
Recent data breaches in major corporations have emphasized the need for privacy in the digital era. Among the many challenges that designers of distributed storage systems face is the ability to support private information retrieval (PIR) protocols. These protocols enable the end user to retrieve an entry of the database, while concealing the identity of that entry from the servers. This paper studies PIR protocols in a particular common type of distributed storage systems.
Coding for storage systems has developed tremendously in recent years. However, many system designers still favor replication techniques, over more involved ones, as a means to guarantee robustness against hardware failures [12, 5]. In spite of having high storage overhead and low failure resilience, replication is often preferred due to its simplicity of implementation. In addition, various types of replication systems are studied in theoretical research due to their real-world impact and ease of analysis [18, 29, 30, 9, 19]. However, since contemporary datasets are far too large to be stored on one machine, it is usually the case where every machine stores a small number of selected files from the dataset, each of which is replicated among geographically separated machines. In turn, such systems can be modeled as hypergraphs, where nodes represent storage servers and (hyper-)edges represent files. In these graphs, an edge is incident with a node if a copy of the respective file is stored on the respective server. Storage systems which broadly adhere to the above outline are called graph-based replication systems. A graph based replication system in which every file is replicated times is called an -replication system, and is called its replication factor.
One of the most important metrics by which PIR protocols are measured is their collusion resistance. In its most simplistic form, a PIR protocol must guarantee perfect privacy against every individual server111In some settings, only computational privacy is required, but this paper focus exclusively on perfect privacy.. That is, it should be computationally impossible for every individual server to infer any information regarding the identity of the requested file. The term collusion resistance measures the ability of a PIR protocol to perform beyond this baseline. That is, what is the maximum number of servers that still remain completely oblivious to the identity of the file, even if collusion among them is permitted. Traditionally, the term âcollusionâ stems from a mindset which considers the servers themselves as adversaries. Yet, the authors of this paper deem this interpretation obsolete, since it does not align with contemporary storage services. Instead, one can think of geographically separated servers as having independent security protocols, that must be individually broken by an adversary. In this case, the term âcolluding serversâ refers to a set of servers whose security was breached by an outside adversary, that can therefore observe their input and output. Normally, the term -privacy of a given protocol indicates the maximum number of servers that cannot infer any information regarding the identity of the file even if they collude; and in our alternative viewpoint, is the minimum number of individually-secured servers that must be breached by an adversary in order to infringe the perfect privacy of the protocol. Nevertheless, in our choice of terms we comply with the standard nomenclature.
PIR protocols have been studied extensively in the past years, and many additional metrics of interest were defined. Among the metrics of interests are: (a) the PIR rate, which measures the ratio between the size of the desired data and the size of the downloaded one; (b) the upload complexity, which measures the size of the queries that are sent to the servers; and (c) the storage overhead, which measures the amount of redundancy in the system. While our main concern is understanding the collusion resistance of the system, we also address some of these metrics in our analysis.
In this paper we initiate a study about PIR protocols in graph based replication systems, and our primary focus is studying their collusion resistance. Since such systems are inherently non-uniform, in the sense that every server stores a different part of the dataset, one might expect that the collusion resistance will act accordingly. Indeed, our results show that the right viewpoint for analyzing colluding sets is not their size, but rather the structure of their induced subgraph. In particular, perfect privacy is maintained if the colluding sets do not contain certain sub-graphs.
Our results shed light on the design of such systems in a bilateral manner. On one hand, we provide recommendations for system designers regarding the file dispersion in the system. On the other hand, we provide a way for analyzing the collusion resistance of a given system. In particular, we provide a PIR protocol for -replication systems and show that its PIR rate at least half of its optimal value in many cases of interest. For larger replication factors we provide a simple scheme whose collusion resistance is less than the replication factor, and another scheme which obtains a larger collusion resistance by a reduction to the two -replication case.
Further, we suggest an alternative graph-based coding approach, in which every file is coded by using an MDS code, and the resulting codeword symbols are dispersed as in graph-based replication systems. While this approach reduces the storage overhead and increases the PIR rate, it requires a careful file dispersion in order to guarantee high collusion resistance. The results in this paper, and graph-based coding in particular, call for future research and practical implementations, that would hopefully bring the vast PIR literature closer to realistic storage systems.
This paper is structured as follows. Preliminaries and previous works are discussed in Section II. Protocols and bounds for -replication systems are given in Section III, and larger replication factors are discussed in Section IV. Then, graph-based coding is discussed in Section V, and open problems for future research are discussed in Section VI.
II Preliminaries
For a prime power let be the field with elements. In a PIR protocol (not necessarily a graph-based one), a dataset , which consists on files , is stored across storage servers in a possibly coded manner. The user wishes to download the file , where for the sake of the probabilistic analysis, is seen as uniformly distributed over . To this end, the user uses randomness in order to generate queries , one for every server. In turn, server replies with , that is a deterministic function of and the serverâs content. The protocol is called -private if for every subset of size at most ,
[TABLE]
where denotes mutual information. Alternatively, the protocol is -private if and are independent. Finally, the PIR rate of the system is , i.e., the ratio between the size of the desired data and the amount of downloaded one, both measured in symbols.
In a graph-based replication system every file is replicated multiple times and each one of the copies is stored on a different server. If all files are replicated an identical number of times , we say that it is an -replication system, and is its replication factor. In a 2-replication system a graph structure arises, in which nodes represent servers, edges represent files, and an edge is incident with a node if the respective file is stored on the respective server. Similarly, in -replication systems for an -uniform hypergraph222That is, a hypergraph in which all edges contain an identical number of nodes. structure arises, and in systems where every file is replicated a different number of times, a non-uniform hypergraph arises. Notice that for , a multigraph333A multigraph is a graph in which a certain edge can appear multiple times. Multiple occurrences of the same edge are called parallel edges. might arise, in cases where there exist two servers that share more than one file in common. While our analysis does not exclude these cases, they result in poor collusion resistance and impede the overall message. Therefore, we restrict our attention to systems in which every two servers store at most one file in common (see Remark 7 for further discussion).
Graphs are denoted by , where and . Unless otherwise stated, all graphs in this paper are undirected, and hence, an edge is a subset of vertices (subset of size two in ordinary graphs, and of arbitrary size in hypergraphs). For a given graph we denote its set of edges by and its set of vertices by . Since graphs represent storage systems in this paper, the terms node, vertex, and server are used interchangeably, and so does the terms edge and file.
For a graph and a subset we denote by the subgraph induced by , i.e., the graph which consists of the nodes in and all the edges in that both of their incident nodes are in . A cycle in is a subgraph of whose nodes are for some , and whose edges are , and these edges exist also in . An edge is said to be incident with a vertex , and vice versa, if . The set of edges in that are incident with are denoted by , where is omitted if clear from context. The incidence matrix of a graph is a binary matrix in which rows correspond to nodes and columns correspond to edges, and an entry contains if and only if the respective vertex is incident with the respective edge. In the sequel, the well-known Breadth First Search (BFS) algorithm is used repeatedly, in graphs as well as in hypergraphs, and the uninformed reader is referred to [7].
In all subsequent protocols, the queries are vectors in , i.e., they contain a field element for every file. However, since the servers contain only a portion of the files in the system, the user communicates only their support to the servers. We denote by the matrix whose âth row is for every , and note that it is a random variable that depends on , and on the randomness at the user. In cases where is fixed, we denote the matrix of queries by .
Since submatrices are used repeatedly, we define the following notation. For a matrix and sets and , let be the submatrix of that consists of the rows in and the columns in . Further, let and . For vectors and we define and analogously. For convenience, we consider the rows and columns of a matrix as indexed by and , respectively, rather than by and . For example, if and , then is a matrix whose entries are indexed by . Since submatrices of are in strong correspondence with subgraphs of , for every subgraph of (denoted ) we denote , and similarly, for every vector we define .
By and large, we use lower-case letters () to denote scalars, boldface letters () to denote vectors (all of which are row vectors), capital letters () to denote matrices or graphs, and calligraphic letters () to denote sets. Finally, we use the standard notation to denote a linear code of length and dimension over .
II-A Previous work
Originally defined in [6], the PIR problem has attracted a tremendous amount of research in the past two decades; and due to its tight connection with distributed storage, PIR enjoyed an increasing attention in the past few years. Since a comprehensive summary of previous works is beyond the scope of this paper, we list herein only a partial list of recent contributions, and elaborate on the most relevant ones.
The recent surge of interest in PIR, which addresses the problem from a distributed storage standpoint, includes the reduction of storage overhead by using error correcting codes in [10] and its improvement in [3]; obtaining secrecy by one extra bit in [17] and its improvement in [4]; and an extensive line of works regarding achievability and capacity in various scenarios, such as multi-round, multi-message, symmetric, and with byzantine or colluding servers [20, 21, 23, 1, 26, 2, 22]. This line of works is a natural extension of an earlier one in the computer science community, which addressed the problem in a more simplistic fashion. Namely, the dataset is assumed to be replicated in its entirety on all servers in the system, and the files are assumed to consist of a single bit. Furthermore, this problem is strongly connected to locally decodable codes [27, 28], and has seen a substantial progress recently [8].
All of the aforementioned works fall into either one of two extremes in the approach towards PIR. In one, the dataset in its entirety is stored in every server, and in the other it is coded by using an MDS code. The current work addresses a sweet spot between the two, that is strongly motivated by real-world applications [12, 5], as well as a plethora of storage models that were addressed in the past [18, 9, 29, 30, 19].
Nevertheless, two notions that are relevant to this work were recently addressed in the literature. First, one may consider the special case of graph-based replication in which the degree444The degree of a node in a graph is the number of edges that are incident with it. of the nodes in the graph is upper bounded by some parameter. Evidently, this special case is strongly connected to a recent work [25], that addressed the general coded PIR question in cases where each server is constrained to contain only a fraction of the entire dataset. Yet, [25] did not impose the particular replication structure that is fundamental to our approach, and more importantly, did not consider collusion. Furthermore, we emphasize that our graph-based approach is highly flexible, in the sense that no constraint is imposed other than every file being replicated on a subset of the servers.
Another notion that was previously studied is that of collusion patterns [24, 13]. In this variant, the system must guarantee collusion resistance against specific subsets of servers, rather than any subset up to a certain size. This notion bears some similarity to this work, since one may compel the vertices in these specific sets not to induce a subgraph which infringes privacy in our scheme. However, the approach and the results of these works is substantially different from ours, e.g., since [24] only discuss coded storage, and [13] discussed replication of the entire dataset in every server, and disjoint colluding sets.
III Replication factor two
III-A A PIR protocol for 2-replication systems
In this section it is assumed that the replication factor is two, and that every two servers store at most one file in common (see Remark 7), which results in a graph . The scheme applies for any field with at least three elements. Upon requiring file , the user randomly chooses a vector , a vector , and an element , all uniformly at random, and defines
[TABLE]
where is obtained from by replacing the lower -entry in each column with , and then replacing the -entry in column by .
Let , the query for server , be the -th row of . Clearly, to upload this row we only need to send the values of its nonzero entries, and hence the total upload complexity is . Each node responds with , and therefore the download complexity is , and the PIR rate is . Note that node can calculate the inner product since the support of contains only the indices of the files available to it. Upon receiving the information from all servers, the user has access to . Then, by multiplying from the left by the matrix and by the all ones vector , the user get
[TABLE]
and hence can be recovered. We proceed with studying the collusion resistance of the suggested scheme. The following claim is a special case of a more general one that is given in the sequel (Theorem 4). Nevertheless, it is given here in its current form to maintain simplicity and flow, and its proof is sketched.
Proposition 1**.**
For any set of servers such that does not contain a cycle, we have that .
Proof sketch.
To prove the claim, we analyze the submatrix of queries that is seen by . For clarity, we omit zero columns from this matrix, as well as columns of weight one, since the latter ones are obviously purely random, and cannot cause leakage of information. Hence, the matrix we analyze is chosen according to the random variable .
It is evident that every matrix which is chosen according to has support which is identical to that of . In what follows we explain why every matrix whose support is identical to that of can be obtained by some choice of , and with identical probability, regardless of the value of . Consequently, this proves that no information regarding is leaked.
We calculate by an iterative process that follows a Breadth First Search (BFS) transversal on . Pick an arbitrary , and fix the value of the corresponding (with probability one). Clearly, it follows that for every regardless of whether or not is the entry of which is multiplied by . Having the values of for every fixed, we have that for the same reasons, where is the other end of edge (again, regardless of whether or not is the entry of which is multiplied by ). In other words, we have that fixing an entry in which corresponds to some compels us to fix the values in which correspond to all of . In turn, fixing these entries of compels us to fix the values of at the other endpoints of the edges in . Since does not contain a cycle, we may proceed in a BFS fashion and have that every edge-node incidence in reduces the overall probability of obtaining by . Hence, every such matrix is obtained with probability , where is the size of the support of , and regardless of the value of . Hence, perfect privacy is guaranteed. â
We now turn to study how gracefully the perfect privacy deteriorates if contains one or more cycles, i.e., how much of âs identity is revealed.
Proposition 2**.**
For any cycle in , any matrix in the support of the random variable is invertible if and only if .
Proof.
Let , and observe that . If , then each column of has two nonzero entries and . Hence, is in its left kernel, and thus , where . Moreover, it is an easy exercise to show that any set of columns of are linearly independent, and hence .
On the other hand if , assume without loss of generality that is of the form
[TABLE]
where denotes a nonzero entry. Then, , where (resp. ) is the bottom-left (resp. top-left) submatrix of . Notice that is the product of all -entries in the sub-diagonal of , and that is product of all -entries in the main diagonal of . Hence, since every pair of -entries in any given column are negations of one another, it follows that . Thus, . â
Corollary 3**.**
A set such that contains cycles can narrow down the possible values of (and hence, of itself) to
[TABLE]
where are all cycles in that contain555For we formally define . , and are all cycles in that do not contain .
Proof.
Let be the matrix that is seen by ; chosen according to the random variable . By Proposition 2, the colluding servers can compute the rank of for every cycle in their induced subgraph, and deduce if accordingly. â
We now show that Corollary 3 is in some sense the best that the colluding servers can hope for. Formally, we show that conditioned by , all respective possible queries are obtained with identical probability. The immediate conclusion is that out of the protected bits of , the information leakage if a set collude is precisely ; or, differently put, all files in are equally likely.
To state the main theorem of this paper, whose proof is given in Appendix A, and of which Proposition 1 is a special case, we require the following definition. For and , we say that a matrix in is -compatible with (-compatible, for short) if its support coincides with that of . This definition extends naturally to a subgraph where a matrix in is said to be -compatible if it is -compatible.
Theorem 4**.**
For every subgraph , the support of the random variable is the set of all matrices such that:
- (a)
* is -compatible with ; and*
- (b)
for every cycle ,
[TABLE]
Furthermore, the random variable is uniformly distributed on its support.
First, it is evident that the case where is acyclic in Theorem 4 proves Proposition 1. Second, we have the following corollary.
Corollary 5**.**
For every set and every two distinct values such that , the servers in cannot infer if or .
Proof.
Clearly, it suffices to prove that the random variables and are identical, i.e., the same queries are obtained with identical probabilities. Since both random variables are uniformly distributed on their support by Theorem 4, it suffices to prove that their supports are identical. Also by Theorem 4, it suffices to prove that the conditions (a) and (b) coincide in both cases. For (a) this claim is clear since it does not depend on the value of . For condition (b), we need to prove that if and only if for every cycle in , which is precisely the meaning of . â
We now turn to present several choices of the graph , and the resulting privacy of the PIR schemes. These examples are summarized in Table I.
Example 6**.**
Taking to be the Petersen graph (a -regular graph with nodes, edges, and girth ) allows to store files on servers, files on each, where any servers cannot infer any information regarding . According to the structure of the Petersen graph, at least servers are required to infer the exact identity of . The upload complexity is field elements, and the download complexity is field elements, i.e., the PIR rate is . 2. 2.
Taking to be the complete bipartite graph, with a square integer and , allows to store files on servers. To retrieve a file , the user downloads field elements. The resulting system ensures perfect privacy against all sets such that either or , and in particular, all sets of size three. 3. 3.
Graphs of large (constant) girth are particularly useful since all sets with at most nodes are cycle-free, and hence the resulting protocol is -private. These can be obtained as incidence graphs of generalized polygons **[18, Table I]**, of which Item 2 above is a special case. In particular, for prime power , there exist explicit graphs with degree with (and hence ), where , respectively. The respective download complexities are , , and . 4. 4.
Let be a prime, and let be a positive integer. The Murty graph **[16]** is a -regular graph with nodes, edges, and girth five. In the resulting system, a database of files is stored on servers, files in each, and ensures perfect privacy against any four colluding servers. To retrieve a file, a user downloads field elements. 5. 5.
Ramanujan graphs (e.g., **[15]**) with edges and constant degree have girth . Hence, the system is resilient against any colluding servers, but require download of field elements for some .
Remark 7**.**
It is evident that the correctness of the scheme and its privacy guarantees hold also in cases where there exist two servers that store more than one file in common. However, in the resulting multigraph, these two servers form a cycle, and hence can collude to infer some information regarding the identity of . On the one hand, the system designer may choose to disperse the files while ignoring the aforementioned restriction in order to increase the number of files in the system, at the price of diminishing its privacy guarantees. On the other hand, if the system is designed such that every two servers store at most one file in common, it is clear that .
III-B Bound
In this subsection we explore the limitations of PIR protocols for graph-based replication systems by proving a bound on the PIR rate. The resulting bound is particularly powerful for the important family of regular graphs, for which the bound is within a factor of two from the rate in Subsection III-A. We prove the bound for two-replication systems that provide nontrivial privacy guarantees, namely, the system is at least two-private. In addition, the maximum degree of a vertex in is denoted by .
Lemma 8**.**
In every two-private two-replication system the PIR rate is at most .
Proof.
Let be the induced graph, and let be the fraction of which is downloaded from server by the user. Clearly, it must be that for every edge , since otherwise, servers and can infer that their mutual file is not required by the user, and hence the system is not two-private. Further, the PIR rate of the system is , where is the all âs vector of length and . Hence, an upper bound on the PIR rate of the system is obtained from the optimal solution of the following linear program.
[TABLE]
That is, the inverse of the optimum value of the objective function serves as an upper bound on the PIR rate of the system. The following problem, which is called the dual of (2), is a vector of variables.
[TABLE]
According to the primal-dual theory [7, Sec. 29.4], any solution which is feasible for (3) provides a lower bound for (2). It is readily verified that is a feasible solution for (3), and the objective function for this solution equals . Therefore, the PIR rate is bounded by . â
In cases where is a regular graph, which are particularly interesting since they induce systems with balanced storage, the resulting bound equals . However, the possibility of a considerable rate improvement in highly-unbalanced systems remains widely open.
IV Arbitrary replication factors
In this section we consider -replication systems for , which are favored in practice due to their greater resilience to simultaneous failures [12, 5]. First, for any integer , collusion resistance of can be attained by a simple scheme that is given in Subsection IV-A. Then, we provide another scheme in Subsection IV-B, which guarantees larger collusion resistance by a reduction to the -replication case. The collusion resistance in the latter case will strongly depend on our ability to increase the girth by removing edges from a certain multigraph. To simplify the discussion, in this section we alleviate the requirement that every two servers share at most one file in common.
IV-A Replication factor and collusion resistanceÂ
The user begins by choosing a uniformly random matrix , whose rows sum to , the âth unit vector of length . Then, the user disperses the symbols of the matrix to the queries arbitrarily666This is possible since , where is the length of ., such that every server that stores a file receives a unique entry from the âth column of . In turn, the servers respond with the respective linear combinations , and the user computes .
It is readily verified that every set of servers can observe at most entries in every column of , which appear entirely random, and hence the resulting scheme is private. Notice that there is no restriction on the number of files that can be stored in this system, nor there is a restriction on their dispersion.
IV-B Arbitrary replication factor by reduction
In systems where files might be stored in more than two servers, one can obtain perfect privacy by âignoringâ all but two copies of every file that is replicated more than twice, in a sense that will be made clear shortly, and applying the scheme in Section III. Observe that choosing which copies to ignore may drastically affect the collusion resistance of the system, since each choice produces a different graph with different cycles. Nevertheless, this observation can in fact contribute to the security of the system by concealing the cycle structure of the resulting graph from an adversary. In what follows we formalize these intuitions and discuss the different aspects of the reduction to the 2-replication scheme.
Evidently, it is natural to consider an -replication system for (or in fact, any replication system) as a hypergraph, where each file corresponds to a hyperedge. Yet, for our purpose it is often more convenient to consider it as a colored multigraph. That is, instead of considering every file as a hyperedge, which is incident with the nodes that contain it, we consider a multigraph in which every edge carries a label (or a color) in . Then, two servers are connected by an edge with label if both of them contain a copy of . Clearly, given a hypergraph , one can easily create the respective colored multigraph by replacing hyperedge with a clique whose edges are labelled by . Notice that can be a multigraph (i.e., contain parallel edges) since hyperedges can intersect in more than one node. An illustration of these definitions is given in Figure 1, which also demonstrates the natural notions of a monochromatic and polychromatic cycles, that will be useful in the sequel. In what follows we use and interchangeably.
Given a replication system with a respective multigraph , it is obvious that the user can choose any two copies of every file, and apply the scheme from Section III while ignoring the remaining copies. Formally, for a server that stores a copy of that is chosen to be ignored by the user, the user simply transmits a zero coefficient for , or omits that coefficient altogether. Further, the operation of ignoring all but two copies of every file corresponds to removing all but one of the edges of every color. Obviously, there are potentially many options to choose which edge to keep for every label, and every such choice can be described by a function such that the edge is labelled by , for every . For any such , let be the result of keeping the edges , and removing the remaining ones. It is readily verified that the resulting scheme guarantees perfect privacy against colluding sets that do not contain a cycle in .
Clearly, if one can choose the file dispersion in the system as one pleases, then it is possible to first choose the dispersion of only two copies of each file, so that the resulting graph has a certain girth. Then, the remaining copies can be dispersed arbitrarily, and the PIR scheme is performed with respect to the function that for every . However, if is given to the user, finding a function such that has a large girth requires more care.
For a given one can choose at random. In spite of not having any clear minimum girth guarantee, this approach has the extra benefit of concealing the cycle structure from an adversary. For a given integer , a function such that has girth , if exists, can be found be deciding the feasibility of the following -program. In this program, for let be the set of all -subsets of such that there exists an edge labelled by .
- âą
Objective: None.
- âą
Variables: .
- âą
Constraints:
- â
for all .
- â
for every such that there exists at least one edge in .
- â
, for every that contain at least one triangle in .
- â
for every that contain at least one -cycle in .
Clearly, the first set of constraints guarantees that exactly one edge is chosen for every file . The second set of constraints guarantees that the resulting choice does not contain -cycles, the next set guarantees that there are no triangles, and so on. Finally, we note that while solving this system for a general is NP-hard, the special case reduces to finding a maximum matching in a bipartite graph, a problem that can be solved efficiently.
V Graph-based coding â Reducing the storage overhead at improved PIR rates
This section discusses storage systems in which every file is similarly stored on a small number of servers, but replication is generalized to arbitrary encoding. Hence, when employing an code with rate larger than (i.e., , we obtain an improvement over previous schemes in terms of storage overhead. Furthermore, it is shown that the resulting PIR rate is improved whenever . However, the (coded) file dispersion must follow a certain structure, and the resulting collusion patterns are in correspondence with polychromatic cycles (see Subsection IV-B and Figure 1), as will be explained next. Finally, we note that the scheme in this section is loosely inspired by ideas from [11] and [14].
Essentially, in the scheme of Section III, every file is coded by using a repetition code of length over the alphabet . Then, every symbol of the resulting codeword is stored on a different server. The scheme which is presented in this section generalizes this concept by employing codes other than the repetition code.
For integers and let be a generator matrix of an MDS code . Consider every file as an matrix over , and let , where the vectors are called the codeword symbols of . Let be disjoint nonempty subsets whose union is (and hence we must have ). Then, for every , disperse the codeword symbols to the servers such that for every , the codeword symbol is in exactly one server which belong to . For example, one can think of a system in which the servers are partitioned to three disjoint subsets; the servers in the first subset contain the first halves of all files, the servers in the second contain the other half, and the servers in the third contain the sums of the two halves (see Example 11 and Example 12 which follow).
The above coding scheme gives rise to an -uniform -partite hypergraph in the following manner. Let be the set of vertices, and define hyperedges , such that contains all servers that store either one of . It is evident that the edges are of size , and that the parts of the hypergraph are the sets . Let be this hypergraph, and let be its respective colored multigraph, as described in Subsection IV-B.
We begin by presenting the PIR protocol for the special case , and later extend it to other parameters by operating in rounds. Begin by choosing , and uniformly at random, and pick an arbitrary subset of size . Then, for every , a server which belongs to receives the following query.
[TABLE]
where is a Boolean indicator for the event â and â. Namely, the user transmits to server the part of the vector that is relevant to it, where arbitrary servers that store a codeword symbol of are having the âth entry of multiplied by . In turn, a server in , which stores for some , responds with . Having the responses , the user composes the following matrix.
[TABLE]
where for , the âth column of e is
[TABLE]
Now, it is evident that every row in the matrix is a codeword in , whose minimum distance is . Therefore, since e has at most nonzero columns, and since , a decoding algorithm777Notice that the âerror valuesâ are in prescribed positions, and hence, an erasure correction algorithm suffices. for can extract e from the matrix that was composed by the user. At this point the user has obtained , that are sufficiently many codeword symbols of in order to retrieve it. Therefore, the PIR rate of this scheme is . The proof of privacy will be given after the general description.
Notice that in the above scheme, codeword symbols of are obtained, while many of those are sufficient to retrieve . However, in cases where , the scheme will not be successful, and in cases where , the resulting scheme will not be exploited to its full potential.
Therefore, to address cases in which , we retrieve multiple files in rounds, a standard practice in the PIR literature (e.g., [11, 14]). That is, we assume that the user wishes to download privately for some , and the protocol operates in rounds. In each round, the user sends a query to every server, and receives responses from all servers. Specifically, we choose and so that , i.e., and . Prior to executing these rounds, the user fixes the following subsets ofÂ
[TABLE]
such that in every row, the sets in the union are pairwise disjoint, such that for every , and such that for every . Intuitively, for and , the set contains the indices of the codeword symbols of that are retrieved during round . The choice of such sets is easy, and is illustrated in Appendix B.
In each round the user executes the aforementioned protocol (for the case ), where is used in lieu of the set . That is, the queries are defined as in (4), with the difference that is a Boolean indicator for the event âthere exists such that and â. Having obtained the responses from all servers in round , the user computes
[TABLE]
where for , the âth column of is
[TABLE]
Since , a decoding algorithm on the matrix can extract the values of . Hence, according to the structures of the sets in (V), it follows that by the end of the âth round, the user has obtained the codeword symbols of for every , and hence all the files can be retrieved. The resulting PIR rate is
[TABLE]
Remark 9**.**
Roughly speaking, the scheme which is described in Section III is as a special case of the one in this section, where ,  , and , and the resulting rate is indeed . However, further simplification is possible for this particular choice of , since the process of extracting the error vector e reduces to multiplying by from the left. Hence, the partitioning of the servers to subsets is not required.
Proposition 10**.**
A set that contains no polychromatic cycles in gains no information about .
Proof.
For that does not contain a polychromatic cycle, let be the set of hyperedges in that have two or more vertices in . Similar to Proposition 1, we analyze the matrix which is chosen according to the random variable . Clearly, every matrix which is chosen according to is -compatible with , and we show that the inverse is also true.
Let be a matrix which is -compatible with . Fix some as the starting point of the BFS algorithm, and choose an arbitrary value for (with probability ). Once is fixed, it is evident that for every hyperedge that is incident with regardless of the value of the Boolean indicator . Notice that the only mutual element of these hyperedges is , since otherwise, a polychromatic cycle of length two would exist in . Therefore, once is fixed for such a hyperedge , we have that for every such that , again, regardless of . Proceeding in a BFS fashion, we have that each node-hyperedge incidence reduces the overall probability of obtaining by a multiplicative factor of . Since does not contain a polychromatic cycle, no discrepancy is encountered, which concludes the proof. â
Example 11**.**
Consider , and let be the parity code , and hence and . Also, let , , and . Consider the following hyperedges.
[TABLE]
It is readily verified that every two distinct edges intersect in at most one node, and hence, there are no polychromatic cycles of length . The resulting system is -private, has storage overhead , and its PIR rate is .
Example 12**.**
Generalizing the previous example, let be any integer divisible by , let be the parity code, and let , , and . Let be edge-disjoint maximum matchings888Recall that a matching is a subset of disjoint edges. A maximal matching is a matching such that any edges that is added to it violates the disjointness of its edges. A maximum matching is a matching of the largest possible cardinality. It is readily verified that a complete bipartite graph contains disjoint maximum matchings. in a complete bipartite graph whose one side is , and the other is . Notice that for every , and consider the following hyperedges.
[TABLE]
We claim that any two of the above hyperedges intersect in at most one node. Assuming otherwise we have for some integers and . If , it follows that the edges and in share a vertex, even though they both belong to , a contradiction. If , it follows that the matchings and both contain the edge , another contradiction.
Therefore, the resulting system is -private, accommodates files, incurs storage overhead of , and has PIR rate of . For comparison, considering the full graph on nodes and applying the scheme in Section III provides a -private system with files, storage overhead , and comparable PIR rate .
VI Discussion and open questions
In this paper we initiated a study of private information retrieval for a specific storage model that is widely used in practice, and widely studied in theoretical research. In order to improve our understanding of this model, and in order to improve its applicability to real-world systems, we suggest the following research directions.
Close the gap between achievable PIR rate in Subsection III-A and the upper bound in Subsection III-B. 2. 2.
Improve the collusion resilience in systems with arbitrary replication factors. 3. 3.
Construct families of dense graphs in which  (1) is large for every and every . 4. 4.
Study graceful degradation for replication factors larger than two. 5. 5.
Find PIR schemes for -replication systems that guarantee collusion resistance against cycles, and are nontrivial (i.e., download less than the entire dataset).
Acknowledgments
The work of Itzhak Tamo was supported in part by Israel Science Foundation (ISF) Grant 1030/15 and NSF-BSF Grant 2015814. The work of Eitan Yaakobi was supported in part by Israel Science Foundation (ISF) grant 1817/18. The work of Netanel Raviv was supported in part by the postdoctoral fellowship of the Center for the Mathematics of Information (CMI) in the California Institute of Technology.
Appendix A Proof of the main theorem
The proof of Theorem 4 requires two auxiliary lemmas (Lemma 13 and Lemma 14), and then is proved in two parts (Lemma 15 and Lemma 16).
Lemma 13**.**
Let be a cycle with edges, and let be a matrix which is -compatible, where is the maximum index of an edge in . Then, there exist precisely vectors such that is -compatible and .
Proof.
First, observe that since is a tree, and since is -compatible with , it follows that . Hence, the added vector a must be in , i.e.,
[TABLE]
where the âs are the columns of and the âs are coefficients from . Furthermore, since must be compatible with , the column a must contain nonzero entries precisely in row and row , that correspond to the two vertices incident with edge . Hence, since each row of contains precisely two nonzero entries in some columns and , it follows that intersecting the column span of with reduces the degrees of freedom in (6) by , since it renders any one of to be a linear function of the other. Therefore,
[TABLE]
Since any nonzero vector in is a suitable candidate for a, the claim follows. â
Lemma 14**.**
If an edge is on a cycle in , then there exists a BFS ordering of for which is a back edge.
Proof.
Denote and choose which maximizes , where distance between two vertices is defined as the number of edges in the shortest path between them. Without loss of generality, assume that , and consider a BFS run which begins at . Partition to layers according to their distance from , and recall that edges inside each layer are always back edges. Hence, if is inside a layer, we are done. Otherwise, assume that is in for some , and hence is in . Since is on a cycle, there exists another edge from a node to . Hence, in cases where pops out of the queue before , will indeed be a back edge. It is readily verified that the order of insertion of discovered vertices in the same layer is arbitrary, and hence there exists a BFS run in which predates , and the claim follows. â
We now turn to prove Theorem 4 in two parts.
Lemma 15**.**
For every subgraph , the support of the random variable is the set of all matrices such that:
- (a)
* is -compatible with ; and*
- (b)
for every cycleÂ
[TABLE]
Proof.
For simplicity assume that , but other cases can be proved similarly. By the definition of , it is evident that (a) is necessary, and according to Proposition 2, it follows that (b) is necessary. In what follows, it is shown that (a) and (b) are also sufficient. To this end, let be a matrix which satisfies (a) and (b), and it is shown that there exists a choice of and for which produces .
Consider a BFS run on , and number and according to their discovery times. That is, let be the vertices of sorted by their discovery times, and let be the edges of sorted by their discovery times. Also, assume that if , and closes a cycle, then it is a back edge (see Lemma 14). The values of , and which produce are determined according to this BFS ordering, as follows.
First, fix an arbitrary value in for . Then, since is incident with the edges , we fix the values of as . Then, for , that are the end vertices of , respectively, we fix . If is not on a cycle in , and happens to be, say, , then we can obviously choose , where is arbitrary (the case where lies on a cycle is treated in the sequel). Clearly, this process goes on unhindered as long as a back edge is not discovered.
Once a back edge is discovered, we have that were already determined in earlier stages of the algorithm. Hence, we ought to show that there exists for which
[TABLE]
To this end, let be a cycle which is discovered in whole when is discovered and let be its number of edges. Further, let , i.e., the partial matrix of which corresponds to the subgraph . Similarly, let be the matrix which corresponds to the choice of entries in and up until is discovered. By the correctness of the algorithm so far, it follows that . Moreover, both and are -compatible, and by the definition of , the submatrix is -compatible, and its rank is . According to Lemma 13 there exist precisely columns that extend (and also ) to a -compatible matrix of rank , one of which is . Further, it is evident that the matrix , for any of the possible values of , results in a -compatible matrix of rank as well. Therefore, there exists a 1-1 correspondence between the possible values of and . Since one of is the actual âth column of , it follows that there exists a unique value of which satisfies (7).
If lies on a cycle in , we denote . Since is a back edge, we have that and were determined in earlier steps of the algorithm. Hence, we must find and for which
[TABLE]
Clearly, the choice satisfies (9), and consequently, satisfies (8). We are only left to show that this value for is neither [math] nor . First, it is obviously nonzero as a product of nonzero terms. Second, if happens to be the answer, we have by Proposition 2 that is rank-deficient, in contradiction with condition (b). â
Lemma 16**.**
For every , the random variable is uniformly distributed on its support.
Proof.
Let be a matrix in the support of . By following the proof of Lemma 15, we have that once is fixed, and as long as a back edge is not discovered, every edge-node incidence reduces the overall probability of obtaining by . In addition, every back edge which is not reduces the probability of obtaining by due to (7), instead of by for tree edges999An edge which is not a back edge in a BFS ordering is called a tree edge.. Finally, if lies on a cycle, it reduces the overall probability by due to (9) and by due to (8). Therefore, we have the following, where denotes the number of edge-node incidences in , and denotes the number of back edges in a BFS run (which is identical in every run of a BFS algorithm).
- âą
If is not on a cycle in then .
- âą
If is on a cycle in then .â
Appendix B Choice of sets
The process of choosing the sets in (V) is very simple, and is best illustrated by the following examples.
Example 17**.**
Assume that and , which implies that and . Consider the following matrix
[TABLE]
which naturally corresponds to the sets
[TABLE]
As another example, in which , we may consider the following.
Example 18**.**
Assume that and , which implies that and . Consider the following matrix
[TABLE]
which naturally corresponds to the sets
[TABLE]
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] K. Banawan and S. Ulukus, âThe capacity of private information retrieval from coded databases,â ar Xiv:1609.08138 [cs.IT], 2016.
- 2[2] K. Banawan and S. Ulukus, âMulti-message private information retrieval: Capacity results and near-optimal schemes,â IEEE Transactions on Information Theory , 2018.
- 3[3] S. Blackburn and T. Etzion, âPIR array codes with optimal PIR rate,â ar Xiv:1607.00235 [cs.IT], 2016.
- 4[4] S. Blackburn, T. Etzion, and M. B. Paterson, âPIR schemes with small download complexity and low storage requirements,â ar Xiv:1609.07027 [cs.IT], 2016.
- 5[5] Apache Cassandra TM 2.1 for DSE, Data replication, https://docs.datastax.com/en/cassandra/2.1/cassandra/architecture/architecture Data Distribute Replication_c.html .
- 6[6] B. Chor, O. Goldreich, E. Kushilevitz, and M. Sudan, âPrivate information retrieval,â IEEE 36th Annual Symposium on Foundations of Computer Science (FOCS), pp. 41â50, 1995.
- 7[7] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to algorithms. MIT press, 2009.
- 8[8] Z. Dvir and S. Gopi, â2 server PIR with sub-polynomial communication,â Forty-Seventh Annual ACM on Symposium on Theory of Computing (STOC), pp. 577â584, 2015.
