Private Information Retrieval from Non-Replicated Databases
Karim Banawan, Sennur Ulukus

TL;DR
This paper investigates private information retrieval from non-replicated databases with specific graph-structured storage, deriving capacity bounds and proposing optimal schemes for cyclic and fully-connected configurations.
Contribution
It introduces capacity bounds and capacity-achieving schemes for PIR in non-replicated, graph-structured databases, extending beyond traditional replicated models.
Findings
PIR capacity for cyclic graphs is 2/(K+1).
PIR capacity for fully-connected graphs is min{2/K, 1/2}.
Non-replication causes significant capacity degradation.
Abstract
We consider the problem of private information retrieval (PIR) of a single message out of messages from non-colluding and non-replicated databases. Different from the majority of the existing literature, which considers the case of replicated databases where all databases store the same content in the form of all messages, here, we consider the case of non-replicated databases under a special non-replication structure where each database stores out of messages and each message is stored across different databases. This generates an -regular graph structure for the storage system where the vertices of the graph are the messages and the edges are the databases. We derive a general upper bound for that depends on the graph structure. We then specialize the problem to storage systems described by two special types of graph structures: cyclic graphs and…
| Database 1 () | Database 2 () | Database 3 () |
| Database 1 | Database 2 | Database 3 |
|---|---|---|
| Database 1 | Database 2 | Database 3 |
|---|---|---|
| Database 1 | Database 2 | Database 3 | |
|---|---|---|---|
| rep. 1 | |||
| rep. 2 | |||
| rep. 3 | |||
| Database 1 () | Database 2 () | Database 3 () | Database 4 () |
| Database 1 | Database 2 | Database 3 | Database 4 |
|---|---|---|---|
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Private Information Retrieval from Non-Replicated Databases††thanks: This work was supported by NSF Grants CNS 15-26608, CCF 17-13977 and ECCS 18-07348.
Karim Banawan Sennur Ulukus
Department of Electrical and Computer Engineering
University of Maryland
College Park
MD 20742
[email protected] [email protected]
Abstract
We consider the problem of private information retrieval (PIR) of a single message out of messages from non-colluding and non-replicated databases. Different from the majority of the existing literature, which considers the case of replicated databases where all databases store the same content in the form of all messages, here, we consider the case of non-replicated databases under a special non-replication structure where each database stores out of messages and each message is stored across different databases. This generates an -regular graph structure for the storage system where the vertices of the graph are the messages and the edges are the databases. We derive a general upper bound for that depends on the graph structure. We then specialize the problem to storage systems described by two special types of graph structures: cyclic graphs and fully-connected graphs. We prove that the PIR capacity for the case of cyclic graphs is , and the PIR capacity for the case of fully-connected graphs is . To that end, we propose novel achievable schemes for both graph structures that are capacity-achieving. The central insight in both schemes is to introduce dependency in the queries submitted to databases that do not contain the desired message, such that the requests can be compressed. In both cases, the results show severe degradation in PIR capacity due to non-replication.
1 Introduction
Private information retrieval (PIR), introduced in [1], is a canonical problem to study the privacy of users as they download content from public databases. In the classical setting, a user is interested in retrieving a single message (file) out of messages from replicated and non-colluding databases, in such a way that no database can know the identity of the user’s desired file. The PIR problem has become a vibrant research topic within information theory starting with trailblazing papers [2, 3, 4, 5, 6, 7, 8]. In [9], Sun and Jafar introduce the PIR capacity, which is the supremum of the ratio of the number of bits of desired information () that can be retrieved privately to the total downloaded information. They characterize the PIR capacity of the classical PIR problem to be . The fundamental limits of many interesting variants of the problem have been investigated in [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56].
A common assumption in most of these works is that the entire message set is replicated across all databases. This is crucial for constructing capacity-achieving schemes, as in many existing schemes the undesired symbols downloaded from one database are exploited as side information in the remaining databases, and replication is the key that enables downloading any bit from any database and using it as side information at any other database. However, the replication assumption may not be practical in next-generation storage systems and networks. From a storage point of view, message replication is impractical as it incurs high storage cost, especially for storage systems with a large number of messages or files with a large size. From a network structure point of view, in next-generation networks where peer-to-peer (P2P) connections will be prevalent, nodes (i.e., databases) may not necessarily possess the same set of messages. These practical scenarios, which challenge the replication assumption, motivate investigating PIR in non-replicated storage systems. In this work, we aim at devising achievable schemes that do not rely on message replication, and at the same time, that are more efficient than the trivial scheme of downloading the contents of all databases. We aim at evaluating the loss in the PIR rate due to non-replication and investigating the interplay between the storage structure and the resulting PIR rate.
A few works have considered relaxing the replication assumption: Reference [14] investigates the case when the contents of the databases are encoded via an MDS code instead of assuming data replication. [14] derives the PIR capacity for this setting, which reveals a fundamental tradeoff between storage cost and retrieval cost. Reference [40] studies the PIR problem from storage constrained databases. In this problem, each database is constrained to store uncoded bits with (as opposed to bits needed in replicated databases). [40] shows that symmetric batch caching, which was originally introduced for centralized coded caching systems in [57], results in the largest possible PIR rate under storage constraints. This problem is extended to the decentralized setting in [54], where each database stores bits randomly and independently from any other database. [54] shows that uniform and random bit selection, which was introduced for decentralized coded caching systems in [58], results in the largest possible PIR rate under storage constraints.
The work that is most closely related to our work here is [55]. The databases in [55] store different subsets of the message set. Different from the previous works on non-replication such as [40, 54], in [55] databases store full messages and not portions of every message. In particular, [55] investigates the case when every message is replicated across two databases only. This storage system, in this case, can be represented by a graph, in which every two databases are connected via an edge corresponding to the common message. [55] proposes an achievable PIR scheme that is immune against colluding databases, that do not form a cycle in the graph. The scheme in [55] achieves a retrieval rate of . The work in [55] highlights some interesting insights about the relation between some combinatorial properties of the graph and the immunity against database collusion. In the extended version of [55] in [56], which has appeared concurrently and independently of our work here, an upper bound is proposed to show that their PIR rate is at most a factor of 2 from the optimal value for regular graphs, and the techniques are extended to larger replication factors.
In this paper, we consider PIR of a single message out of messages from non-replicated and non-colluding databases. In our formulation, each message appears in different databases, and every database stores different messages. Thus, the storage system is parameterized by such that , where is the total number of messages in the system, is the replication factor of each message, is the storage constraint of each database, and is the number of databases. We focus on the case . For this case, the storage system can be uniquely specified by an -regular graph. In our graph formulation, the messages correspond to the vertices and the databases correspond to the edges. This is in contrast to [55], where , and the roles of messages and databases are reversed on the graph. Hence, our graph formulation may be considered as the dual graph formulation to [55]. Our goal is to characterize the PIR capacity of this system.
First, we derive a general upper bound on the retrieval rate for storage systems described by -regular graphs. Interestingly, the upper bound depends on the structure of the graph and not only on . In particular, the upper bound is related to the longest sequence of databases that cover all of the messages in the storage system. We specialize the problem further to two classes of graphs, namely, cyclic graphs and fully-connected graphs, where we obtain exact results. In cyclic graphs, all vertices form a circle connected by edges. Therefore, each vertex (a message) emanates two edges (two databases), which means that each message is common among two adjacent databases which are arranged in a cycle. Thus, in this case , and since in this paper, using mentioned above, we have, . For this type of graphs, we show that . The achievable scheme starts from the greedy algorithm of Sun and Jafar [9] and then compresses the requests to databases by replacing the individual symbols of the scheme in [9] by sum of two messages. This compression necessitates exploiting side information even in databases that do not contain the desired messages. In fully-connected graphs, each vertex is connected to all of the remaining vertices. Therefore, each vertex (a message) emanates edges ( databases), which means that each message resides in databases. Thus, in this case , and since , from , we have , i.e., . That is, all combinations of two messages appear in a different database. In this case, we show that . For , for this case, we propose a novel achievable scheme, which is based on retrieving a single weighted sum (with respect to sufficiently large field) of two symbols from every database. For the comparable cases with [55], our scheme outperforms their scheme in terms of the PIR rate. We note that, in both cyclic and fully-connected graph cases, the PIR capacity converges to zero as , which implies a severe degradation in the PIR efficiency due to non-replication.
Finally, we show an example for a storage system with . We provide a novel achievable scheme that uses processed side information and outperforms the scheme in [55].
2 Problem Formulation
Consider the problem of PIR from non-replicated and non-colluding databases. We denote the databases by . The storage system stores messages in total, each message is stored across different databases, i.e., is the repetition factor for every message, and each database stores locally different messages. We denote the message set by . Each message is a vector of length picked in an i.i.d. fashion from a sufficiently large finite field ,
[TABLE]
The storage system is parameterized by . We note that for a feasible storage system (that is symmetric across databases and messages), we have . In this work, we focus on the case . To fully characterize the storage system in this case, we represent the storage system as -regular graph111We note that the graph used in our formulation may be considered as the dual graph of the one used in [55]. In our work, , the nodes are the messages and the edges are the databases, while in [55], , the nodes are the databases and the edges are the messages.; see Fig. 1 and Table 1 for a example. We characterize the storage system by a regular graph, where is the set of vertices, and is the set of edges, i.e., in this graph, the vertices are the messages and the edges are the databases. An edge drawn between messages and means that the contents of database is . This graph is an -regular graph, since each message is repeated times across the storage system. In the following, we define specific parameters of the graph, which are needed while constructing the converse proof.
Definition 1** **(Graph reduction)
The graph is reduced iteratively starting with the vertex by enumerating all the edges connecting to , and removing all neighboring vertices connected to enumerated edges except one, which we denote by . The process of enumerating edges and removing corresponding neighbors iteratively continues until one vertex is left after reductions.
Definition 2** **(Spread of the graph)
The spread of a graph is the largest sequence of edges (databases) that results from the graph reduction procedure given in Definition 1.
An example graph reduction for the storage system given in Fig. 1 and Table 1 is shown in Fig. 2. In this work, we further focus on two special classes of -regular graphs, namely: cyclic graphs and fully-connected graphs.
Definition 3** **(Cyclic graphs)
The graph is called a cyclic graph if each two adjacent vertices are connected by an edge and no non-adjacent vertices are connected by an edge, i.e., the contents of the databases can be written as (without loss of generality):
[TABLE]
Consequently, the cyclic graph is parameterized by , and the spread of the graph is .
Definition 4** **(Fully-connected graphs)
The graph is called fully-connected if every two vertices are connected by a unique edge. Hence, the contents of the databases can be written as the subsets of with 2 elements. The fully-connected graph is parameterized by , and the spread of the graph is .
In PIR, the user wants to retrieve a message without leaking any information about the identity of the message to any individual database. To that end, the user sends queries, one for each database. These queries are independent of the messages as the user has no information about the messages prior to retrieval, hence,
[TABLE]
The databases respond to the user queries by answer strings . The answer string is a deterministic function of the query and the contents of the database , which is denoted by , therefore,
[TABLE]
In PIR, we have two formal requirements. First, we have the privacy requirement. To ensure privacy, the retrieval strategy intended to retrieve must be indistinguishable from the retrieval strategy intended to retrieve for any and , i.e.,
[TABLE]
where denotes statistical equivalence.
The second requirement is the reliability requirement. The user needs to be able to reconstruct perfectly222The results of this work do not change if we relaxed the reliability constraint to allow arbitrarily small probability of error, i.e., if we changed the reliability constraint as . from the collected answers, i.e.,
[TABLE]
We measure the efficiency of the retrieval scheme by the retrieval rate . An achievable retrieval scheme is a scheme that satisfies (6), (7) for some message length . The retrieval rate is the ratio between the length of the desired message and the total download,
[TABLE]
The PIR capacity is the largest PIR rate over all achievable schemes, i.e., .
3 Main Results
In this section, we present the main results of this paper. Our first result is a general upper bound for storage systems defined by -regular graphs with and arbitrary which is given in the following theorem. The proof of Theorem 1 is given in Section 4.
Theorem 1** **(Upper-bound for -regular graphs)
For an -regular graph storage system with , the retrieval rate is upper bounded by
[TABLE]
Remark 1
The upper bound reveals a dependency on the structure of the storage system, captured in the spread of the graph . I.e., the upper bound cannot be parameterized by only. This opens the door for joint optimization of the storage system together with the retrieval scheme.
Remark 2
The upper bound is a general upper bound which is valid for any storage system with and is represented via an -regular graph (including the example shown in Figs. 1 and 2). In this paper, we focus on two special cases, namely:
- •
Cyclic graphs: In this case, the spread of the graph is as we can cover all the messages in the storage system by visiting exactly databases. Furthermore, , as every node in the graph is connected to adjacent nodes only. Applying the bound in Theorem 1, .
- •
Fully-connected graphs: In this case, the spread of the graph is as is connected to all other messages. Applying the bound in Theorem 1, . Also, in this case, and , hence, .
In the following two results, we characterize the PIR capacity of cyclic graphs and fully-connected graphs. The converse proofs for Theorems 2 and 3 are corollaries of Theorem 1 as shown in Remark 2. The achievability proofs of Theorems 2 and 3 are given in Section 5.
Theorem 2** **(Capacity of cyclic graphs)
For a cyclic graph storage system, the PIR capacity is given by
[TABLE]
Theorem 3** **(Capacity of fully-connected graphs)
For a fully-connected graph storage system with , the PIR capacity is given by
[TABLE]
Remark 3
The capacity results in this work reveal a severe loss in the retrieval rate due to non-replication. For the cyclic and fully-connected graphs, as . This is in contrast to the classical PIR problem [9], where as . This is intuitively due to the fact that as , we have . Meanwhile, the number of side information equations generated is limited due to non-replication. In particular, the side information equations are related to (in contrast to in the classical model), while total downloads grow with as the user needs to download from all databases to satisfy the privacy constraint. The ratio as for both cases.
Remark 4
The results of this work outperform the trivial scheme of downloading all messages, which achieves . Our retrieval rate also outperforms the best achievable scheme in [55], which achieves for cyclic graphs (which is the comparable case to our work). The achievable rate for the case of cyclic graphs. This implies that the retrieval rates in non-replicated PIR systems in [55] may be improved. Nevertheless, the results in [55] are more general which are valid for all graph-based storage systems. The results in [55] also cover collusion resistance, which is outside the scope of our work here.
4 Converse Proof
In this section, we prove Theorem 1. To that end, we present a general upper bound for the retrieval rate for general -regular graphs for the case of .
Let denote the collection of all queries to all databases for all desired messages, i.e.,
[TABLE]
We assume that the retrieval scheme is symmetric across databases (as in [8, Lemma 1]). This assumption is without loss of generality, since any asymmetric retrieval scheme can be transformed into a symmetric one by means of time-sharing without changing the retrieval rate. Hence, for , we have
[TABLE]
where . We need the following lemma.
Lemma 1
Let denote the set of databases containing message , then
[TABLE]
**Proof: ** We have
[TABLE]
where (19) follows from the fact that is independent of the messages and the queries , (20) follows from the reliability constraint. For (21), we note that the answer strings are deterministic functions of only, hence is a Markov chain and can be dropped from the conditioning. (23) follows from the fact that answer strings are deterministic functions of the messages and queries, and (24) follows from the database symmetry in (16). Rearranging (24) concludes the proof.
We are now ready to prove the converse statement in Theorem 1. We first prove that . From Lemma 1, we have
[TABLE]
where (27) follows from the fact that conditioning reduces entropy, (28) follows from the symmetry across databases. Therefore,
[TABLE]
Next, we prove that , where is the spread of the graph; see Definition 2. In order to obtain the spread of the graph, we begin by the node representing , then we enumerate all the edges (databases) connecting to . Without loss of generality, label these databases by . These edges are connecting to the nodes corresponding to . Then, we reduce the graph by removing all the connecting nodes to except one (which belongs to the path of the largest distance), which we denote by . We again enumerate all the edges connecting to with the nodes , where is the number of databases that contain after the graph reduction, then we reduce the graph again by removing all nodes connecting to except one, which we denote by , and so on. Then, we have
[TABLE]
where (31) follows from the reliability constraint and the independence of queries and messages, (34) follows from the independence bound, (35) follows from the non-negativity of the entropy function where denotes the answer strings returned by the sequence of the databases that define the spread of the graph, and (37) follows from the fact that conditioning on cannot increase entropy.
To show (38), we note that from the reduction procedure that results in , we have
[TABLE]
for some deterministic function , and is the number of reductions on the graph until all nodes are removed from the graph. Since the leading message at the th graph reduction belongs to the set of the connected messages in the th graph reduction, and at the th graph reduction, the nodes connecting to are removed from the graph, we have , where is the index of the leading message in the th database. Consequently, we can drop as they are deterministic functions of .
Now, we have
[TABLE]
where (46) follows from the privacy constraint, and (47) follows from Lemma 1. Reordering terms, we have
[TABLE]
which together with (29) concludes the proof of Theorem 1.
5 Achievability Proof
In this section, we begin first with a motivating example of to show the basic ingredients of the achievable scheme. In fact, the graph for this motivating example is both cyclic and fully-connected (see Fig. 3), therefore, this motivating example can be considered as a unifying instance of the optimal scheme for both cyclic and fully-connected graphs. Then, we present general capacity-achieving schemes for cyclic graphs and fully-connected graphs. Finally, we show by an example how we can extend the presented schemes to the case of (with no claim of optimality).
5.1 Motivating Example: , , ,
In this example, we consider a storage system that consists of databases. The system stores messages in total, namely . Each message is replicated across databases, such that each database stores messages (see Table 2). This is a cyclic and also a fully-connected graph as shown in Fig. 3.
Without loss of generality, assume that the desired message is . To construct the capacity-achieving scheme, the user randomly permutes the indices of messages independently, uniformly, and privately from the databases. Denote the permuted version of by the vector , the permuted version of by , and the permuted version of by . Pick .
A straightforward solution for this problem is to apply Sun and Jafar scheme in [9]. Since every database contains messages, the user downloads a single bit from each message from each database in round 1, i.e., the user downloads from database 1, from database 2, and from database 3. Now, the user exploits as side information by downloading from database 1, and from database 2. Finally, the user downloads the sum from database 3. The query table for this scheme is shown in Table 3. Note that although the sum is irrelevant to the decodability of , the user needs to download it to satisfy the privacy constraint. Otherwise, database 3 would figure out that the desired message is , as the user requests 2 bits from database 3 when the desired message is , while the user would have requested 3 bits from database 3 if the desired message was or . With this scheme, the user downloads bits from out of the total downloads, hence .
Although this scheme outperforms the scheme in [55] in terms of the retrieval rate (the scheme in [55] achieves ), there is room for improving it. The main source of inefficiency of the scheme is the downloads from database 3, as the user downloads 3 bits and exploits only 2 of them. Moreover, the user downloads new independent bit . If the user introduces dependency to the downloads of database 3, the user may compress333Throughout this work, we use the expressions “dependency” and “compression”. In previous PIR works, the user downloads new and independent undesired symbols at each round, which can be used in later rounds as side information. However, in this work, the user downloads undesired symbols which are dependent on the undesired symbols downloaded from other databases. We download these dependent symbols even from the databases that do not contain the desired message. We call these “dependent” downloads to differentiate them from “side information” downloads, which are intended to be used to decode the desired message directly. Furthermore, by “compression”, we mean downloading shorter (fewer) answer strings than the greedy algorithm in [9] by exploiting the knowledge of the dependent symbols. the requests from database 3, and improve the retrieval rate. In order to do this, the user downloads the sums and from database 3 (see Table 4). For the decodability, the user can decode by canceling from and by canceling from . Therefore, , are decodable by canceling and .
Nevertheless, the scheme in Table 4 is not private because the user still downloads 2 bits from database 3 in the form of sum of 2 bits. To remedy this problem, the user should repeat the compression of the downloads over all databases, i.e., the user should download 2 bits in the same manner of downloading from database 3 in the other two databases as well. Hence, in repetition 2, the user compresses the downloads from database 2 and downloads , . Similarly, in repetition 3, the user downloads and from database 1. The complete query structure is given in Table 5.
Next, we discuss privacy, decodability and the rate of this achievable scheme.
Regarding privacy: The query structure is now symmetric across the databases, and the indices of the bits from each message are chosen uniformly, independently and privately. Hence, all queries are equally likely, and the scheme is private.
Regarding decodability: We note that each repetition is decodable separately. As we discussed above, are decodable in repetition 1. For repetition 2, is decodable directly, is decodable by canceling from , and is decodable by canceling from . Finally, is decodable by canceling from and therefore is decodable by further canceling from (or equivalently by adding and under modulo-2 addition and canceling from the sum). The decodability of repetition 3 follows in a similar way to the decodability of repetition 2 by exchanging the roles of .
Regarding the achievable rate: The user downloads bits from out of a total of downloads. Consequently, which matches the upper bound in Theorem 1.
Remark 5
It is interesting to compare the PIR capacity here to the PIR capacity in [40] where the contents are stored in the databases using the optimal storage strategy under the memory-size constraint . Note that, in this example, as every database stores 2 full messages out of 3 messages. Using the optimal storage strategy in [40], the PIR capacity is which is larger than the PIR capacity here . This implies a loss in the PIR capacity due to storing full messages here as opposed to storing uncoded parts of the messages in [40] subject to the same memory-size constraint.
5.2 General Achievability for the Case of Cyclic Graphs
In this section, we generalize the ideas of the motivating example for arbitrary . The new ingredient in this scheme (in contrast to [9]) is the compression of the queries submitted for a subset of the databases. To satisfy the privacy constraint, the user performs the scheme along repetitions. In each repetition, the user chooses to submit the full query (according to [9]) to databases. For the remaining databases, the user downloads two symbols in the form of 2-sums. The scheme works with symbols. The general scheme for cyclic graphs can be summarized as:
Index preparation: The indices of the symbols of each message are permuted independently, uniformly, and privately at the user side. 2. 2.
Constructing full queries: We apply the scheme of Sun and Jafar [9] to construct the full queries to all databases. We apply this scheme over blocks of . To that end, the user downloads 1 individual symbol from each message from each database in round 1. Next, the user downloads a 2-sum from the stored messages in each database. This sum exploits the side information generated from other databases. Note that since in this graph, the user can generate 1 side information equation for each database. Another change from [9] is that even for the databases that do not contain the desired message, the user exploits the side information generated at other databases by introducing dependency to the answers. 3. 3.
Compressing queries: The user choose different databases at each repetition. The user compresses the queries to these databases by adding the individual symbols in round 1 into single equation.444We note that in some cases, we may need to shuffle the indices of the symbols in the sum to prevent ending up with useless equations. For example, if the full queries are in the form of , then, after compressing, we have the 2-sums , and . Now, imagine that and are decodable from the remaining databases. In this case, the sum of is useless and the sum is not decodable. However, if we shuffle the indices such that the user downloads and , then the user can use both equations to decode and . This would not affect the privacy as the indices are permuted uniformly and privately at the user side. 4. 4.
Repeat step 2, 3 over new blocks of 4 symbols for repetitions.
5.2.1 Decodability, Privacy, and Achievable Rate
Regarding decodability: In this scheme, at each repetition, we have unknowns corresponding to the desired message and unknowns corresponding to the undesired messages. The user downloads 3 equations (full queries) from 2 databases, and 2 equations from the remaining databases. Hence, the user downloads in total equations in unknowns. This linear system is decodable (up to necessary index shuffling).
Regarding privacy: The scheme is private since the symbols are permuted randomly and privately at the user side and the scheme is repeated along all combinations of the databases. Hence, the structure of the queries is the same across all databases. Thus, the distribution of the queries is the same irrespective to the desired message.
Regarding the achievable rate: From every repetition of the scheme, the user can decode 4 symbols from the desired message, thus,
[TABLE]
5.3 General Achievability for the Case of Fully-Connected Graphs
In this section, we present the general achievability for the case of fully-connected graphs. For , we have 1 database containing 2 messages; the capacity-achieving scheme is simply to download the contents of the entire database, hence . For , the capacity-achieving scheme is exactly the motivating example in Section 5.1, hence .
For , the upper bound is the active upper bound. The general achievability for this case is given below. The achievable scheme works with symbols from , where is sufficiently large and is prime.
Index preparation: The indices of the symbols of each message is permuted independently, uniformly, and privately at the user side. 2. 2.
Retrieval from database 1: Denote the permuted contents of the th database by . Without loss of generality, assume that the desired message is stored in database 1, hence, is the permuted version of the desired message. From database 1, the user downloads a weighted sum of two symbols from the two messages, i.e., the user downloads from database 1, where , . The choice of will be specified later. 3. 3.
Exploiting side information: The user downloads different weighted sums from every database. If the th database contains the desired message, the user downloads a new desired symbol in the sum. If the message stored in the th database is undesired, the user exploits the same message symbol in all databases. I.e., the user downloads the weighted sum , where indices are chosen depending on the message (if desired, we increment the index; if undesired we fix the index to 1) 4. 4.
Database symmetry: The user repeats the last step across all databases.
5.3.1 Decodability, Privacy, and Achievable Rate
Regarding decodability: The user collects equations. These equations have unknowns corresponding to the desired message and unknowns corresponding to the undesired messages, i.e., we have a linear system of equations in unknowns. Without loss of generality, assume that the desired message is stored in the first databases, hence . The linear system of equations can be written as:
[TABLE]
The choice of the coefficients , is such that the decoding matrix is invertible. One simple way555In general, one can enumerate all possible that are full rank for every desired message . The user can choose uniformly from this set. to choose is to choose \bm{}uniformlyfromallpossiblepermutationsofthefieldelements.\par Regardingprivacy:Sincethemessagesymbolsandthecoefficientsarepermuteduniformly,andthedistributionofthequeriesforeverydatabaseisthesameirrespectiveofthedesiredmessage,theretrievalschemeisprivate.\par Regardingtheachievablerate:Theuserdownloadsanswerstrings,ofwhicharedesiredsymbolsandaredecodable,thus,\begin{aligned} R_{\text{PIR}}=\frac{K-1}{\binom{K}{2}}=\frac{2}{K}\end{aligned}\par$
5.3.2 Further Example: Fully-Connected Graph with
As a concrete example, we present the achievable scheme for for a fully-connected graph. Hence, we have databases and . Thus, this is a system. We assume that , , , , , and . See Fig. 4 and Table 6 for the graph structure and the database contents. Assume for sake of simplicity that . The scheme works with bits. Denote the permuted message by , by , and so on. Assume without loss of generality that the user is interested in retrieving . Therefore, the user downloads the following:
[TABLE]
This system of equations is full-rank, hence is decodable along with . The rate of retrieval is . To see privacy, let the desired message be , in which case, the user downloads:
[TABLE]
This system of equations is full rank as well. Since, the queries have the same structure and the symbol indices are chosen randomly and privately, the scheme is private.
5.4 Discussion and Further Extensions: Extension to :
In this section, we show how the ideas of can be extended to . We discuss our additional ideas via the following example. In this example, we consider a storage system, whose structure is shown in Table 7. Note that, in this example, our graph formulation fails to represent the storage structure since , however, we can use the graph structure in [55] as in this example. In the following, we only show an achievable scheme for this example without any claim of optimality. In addition to introducing dependency as in the previous schemes, we have a new insight in this case, which is to exploit processed side information.
Our scheme works with symbols from . We permute the indices of the messages uniformly, independently and privately at the user side. We denote the permuted message symbols of by the vectors .
The idea of our scheme is to extend the greedy algorithm of [9] to our setting (see Table 8). To that end, the user starts by downloading individual symbols from every message from every database in round 1. Hence, the user downloads , , from database 1, , , from database 2, and so on.
Returning to the scheme in [9], the user downloads 2-sums in round 2 from all databases. The undesired symbols in round 1 are exploited as side information in round 2. In our case, this is applicable in databases 1 and 2. Therefore, the user downloads the sums , , , from database 1, and , , , . We complete round 2 by downloading , from database 1, and , from database 2 to satisfy the privacy constraint by downloading all combinations of the 2-sums.
In order to proceed with the achievable scheme, we need to generate side information in round 3, which consists of 3-sums. At this point, we note two issues: First, we did not exploit the side information generated from in round 1, as the messages do not appear together at any database. Second, we note that the side information needed in round 3 does not appear directly in any other database unlike [9], i.e., in round 3, we need the side information to be of the form and , which are not available in any other database. This motivates the use of processed side information, i.e., combine side information generated at multiple databases into a single side information that is usable at another database. There are two types of processing in this example, which are: combining double 2-sums and combining triple 2-sums.
First, for combining double 2-sums to get a single side information equation, we download from database 3 and from database 4. By adding the two 2-sums (modulo-2 addition), we get which can be used as side information in database 1. Similarly, we obtain the single side information by adding and and again use it in database 1. Next, we generate the side information needed in database 2. We combine from database 3 with from database 4 to get , and combine from database 3 and from database 4 to get .
Second, we can create extra side information by combining triple 2-sums. To see that, we can add from database 1, from database 3, and from database 4 to create the side information , which can be exploited in database 2. Similarly, we can add from database 2, from database 3, and from database 4 to get , which can be exploited in database 1.
To introduce dependency in databases 3 and 4 as in the previous schemes, we can download , , , and , which result from round 1.
Using this scheme, the user gets desired symbols out of total downloads, resulting in . This outperforms the achievable scheme of [55], which achieves .
We note that this is the first instance of using processed side information in PIR. Further, the presented scheme achieves the bound in Lemma 1 with equality, i.e., , which may be promising. However, a curious question remains which should be investigated further, which is: Can we compress the downloads in the same manner of the achievable scheme for the cyclic graphs by exploiting dependencies?
6 Conclusion
In this paper, we investigated the PIR problem from non-replicated and non-colluding databases. We studied the storage systems, where every database stores messages. This system is uniquely described by an -regular graph. We proved a general upper bound, which depends on the spread of the graph. We derived the capacity of two classes of graphs, namely: cyclic graphs and fully-connected graphs. For these two classes of graphs, we proposed novel achievable schemes, whose retrieval rate matches the developed upper bound. Our results showed that non-replication significantly hurts the retrieval rate.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] B. Chor, E. Kushilevitz, O. Goldreich, and M. Sudan. Private information retrieval. Journal of the ACM , 45(6):965–981, November 1998.
- 2[2] N. B. Shah, K. V. Rashmi, and K. Ramchandran. One extra bit of download ensures perfectly private information retrieval. In IEEE ISIT , June 2014.
- 3[3] G. Fanti and K. Ramchandran. Efficient private information retrieval over unsynchronized databases. IEEE Journal of Selected Topics in Signal Processing , 9(7):1229–1239, October 2015.
- 4[4] T. Chan, S. Ho, and H. Yamamoto. Private information retrieval for coded storage. In IEEE ISIT , June 2015.
- 5[5] A. Fazeli, A. Vardy, and E. Yaakobi. Codes for distributed PIR with low storage overhead. In IEEE ISIT , June 2015.
- 6[6] R. Tajeddine and S. El Rouayheb. Private information retrieval from MDS coded data in distributed storage systems. In IEEE ISIT , July 2016.
- 7[7] H. Sun and S. A. Jafar. Blind interference alignment for private information retrieval. In IEEE ISIT , July 2016.
- 8[8] H. Sun and S. A. Jafar. The capacity of private information retrieval. In IEEE Globecom , December 2016.
