The Capacity of Private Information Retrieval from Heterogeneous Uncoded Caching Databases
Karim Banawan, Batuhan Arasli, Yi-Peng Wei, Sennur Ulukus

TL;DR
This paper investigates the optimal private information retrieval strategy from multiple databases with different storage capacities, revealing that heterogeneity does not reduce the maximum achievable download efficiency.
Contribution
It characterizes the optimal PIR download cost for heterogeneous storage databases and shows it equals the homogeneous case, providing explicit placement for three databases.
Findings
Optimal PIR download cost matches the homogeneous case.
Heterogeneity in storage does not reduce PIR capacity.
Explicit content placement for three databases is provided.
Abstract
We consider private information retrieval (PIR) of a single file out of files from non-colluding databases with heterogeneous storage constraints . The aim of this work is to jointly design the content placement phase and the information retrieval phase in order to minimize the download cost in the PIR phase. We characterize the optimal PIR download cost as a linear program. By analyzing the structure of the optimal solution of this linear program, we show that, surprisingly, the optimal download cost in our heterogeneous case matches its homogeneous counterpart where all databases have the same average storage constraint . Thus, we show that there is no loss in the PIR capacity due to heterogeneity of storage spaces of the databases. We provide the optimum content placement explicitly for .
| Case | Assignment | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
||||||||||
|
|
|
||||||||||
|
|
|
||||||||||
|
|
|
||||||||||
|
|
|
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
The Capacity of Private Information Retrieval from Heterogeneous Uncoded Caching Databases††thanks: This work was supported by NSF Grants CNS 15-26608, CCF 17-13977 and ECCS 18-07348. A shorter version is submitted to IEEE ISIT 2019.
Karim Banawan Batuhan Arasli Yi-Peng Wei Sennur Ulukus
Department of Electrical and Computer Engineering
University of Maryland
College Park
MD 20742
[email protected] [email protected] [email protected] [email protected]
Abstract
We consider private information retrieval (PIR) of a single file out of files from non-colluding databases with heterogeneous storage constraints . The aim of this work is to jointly design the content placement phase and the information retrieval phase in order to minimize the download cost in the PIR phase. We characterize the optimal PIR download cost as a linear program. By analyzing the structure of the optimal solution of this linear program, we show that, surprisingly, the optimal download cost in our heterogeneous case matches its homogeneous counterpart where all databases have the same average storage constraint . Thus, we show that there is no loss in the PIR capacity due to heterogeneity of storage spaces of the databases. We provide the optimum content placement explicitly for .
1 Introduction
The problem of private information retrieval (PIR), introduced in [1], has attracted much interest in the information theory community with leading efforts [2, 3, 4, 5, 6]. In the classical setting of PIR, a user wants to retrieve a file out of files from databases, each storing the same content of entire files, such that no individual database can identify the identity of the desired file. Sun and Jafar [7] characterized the optimal normalized download cost of the classical setting to be . Fundamental limits of many interesting variants of the PIR problem have been investigated in [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53].
A common assumption in most of these works is that the databases have sufficiently large storage space that can accommodate all files in a replicated manner. This may not be the case for peer-to-peer (P2P) and device-to-device (D2D) networks, where information retrieval takes place directly between the users. Here, the user devices (databases) will have limited and heterogeneous sizes. This motivates the investigation of PIR from databases with heterogeneous storage constraints. In this work, we aim to jointly design the storage mechanism (content placement) and the information retrieval scheme such that the normalized PIR download cost is minimized in the retrieval phase.
Reference [36] studies PIR from homogeneous storage-limited databases. In [36], each database has the same limited storage space of bits with , where is the message size (note, perfect replication would have required ). The goal of [36] is to find the optimal centralized uncoded caching scheme (content placement) that minimizes the PIR download cost. [36] shows that symmetric batch caching scheme of [54] for content placement together with Sun-Jafar scheme in [7] for information retrieval result in the lowest normalized download cost. [36] characterizes the optimal storage-download cost trade-off as the lower convex hull of pairs , .
Meanwhile, the content assignment problem for heterogeneous databases (caches) is investigated in the context of coded caching in [55]. In the coded caching problem [54], the aim is to jointly design the placement and delivery phases in order to minimize the traffic load in the delivery phase during peak hours. Reference [55] proposes an optimization framework where placement and delivery schemes are optimized by solving a linear program. Using this optimization framework, [55] investigates the effects of heterogeneity in cache sizes on the delivery load memory trade-off with uncoded placement.
In this paper, we investigate PIR from databases with heterogeneous storage sizes (see Fig. 1). The th database can accommodate bits, i.e., the storage system is constrained by the storage size vector . We aim to characterize the optimal normalized PIR download cost of this problem, and the corresponding optimal placement and optimal retrieval schemes. We focus on uncoded placement as in [36] and [55].
Motivated by [55], we first show that the optimal normalized download cost is characterized by a linear program. For the achievability, each message is partitioned into partitions (the size of the power set of , denoted ). For every partition, we apply the Sun-Jafar scheme [7]. The linear program arises as a consequence of optimizing the achievable download cost with respect to the partition sizes subject to the storage constraints. For the converse, we slightly modify the converse in [36] to be valid for the heterogeneous case. These achievability and converse proofs result in exactly the same linear program, yielding the exact capacity for this PIR problem for all , , . Interestingly, this is unlike the caching problem in [55] with no privacy requirements, where the linear program is only an achievability, and is shown to be the exact capacity only in special cases.
By studying the properties of the solution of the linear program, we show that, surprisingly, the optimal normalized download cost for the heterogeneous problem is identical to the optimal normalized download cost for the corresponding homogeneous problem, where the homogeneous storage constraint is for all databases. This implies that there is no loss in the PIR capacity due to heterogeneity of storage spaces of the databases. In fact, the PIR capacity depends only on the sum of the storage spaces and does not depend on how the storage spaces are distributed among the databases. The general proof for this intriguing result is a consequence of an existence proof for a positive linear combination using the theory of positive linear dependence in [56] (and using Farkas’ lemma [57] as a special case) for the constraint set of the linear program. As a byproduct of the structural results, we show that, for the optimal content assignment, at most two consecutive types of message partitioning exist, i.e., message should be partitioned such that there are repeated partitions over databases and at most one more repeated partitions over databases for some , where . While for general we show the existence of an optimal content placement that attains the homogeneous PIR capacity, for , we provide an explicit (parametric in ) optimal content placement.
2 System Model
We consider PIR from databases with heterogeneous sizes; see Fig. 1. We consider a storage system with i.i.d. messages (files). The th message is of length bits, i.e.,
[TABLE]
The storage system consists of non-colluding databases. The storage size of the th database is limited to bits, for some . Specifically, we denote the contents of the th database by , such that,
[TABLE]
The system operates in two phases: In the placement phase, the data center (content generator) stores the message set in the databases, in such a way to minimize the download cost in the retrieval phase subject to the heterogeneous storage constraints. The placement is done in a centralized fashion [54]. The user (retriever) has no access to the data center. Here, we focus on uncoded placement as in [36, 55], i.e., file can be partitioned as,
[TABLE]
where is the set of bits that appear in the database set , where is the power set. , where . Under an uncoded placement, we have the following message size constraint,
[TABLE]
where . In addition, we have the individual database storage constraints,
[TABLE]
In the retrieval phase, the user is interested in retrieving , privately. The user submits a query to the th database. Since the user has no information about the files, the messages and queries are statistically independent, i.e.,
[TABLE]
The th database responds with an answer string, which is a function of the received query and the stored content, i.e.,
[TABLE]
To ensure privacy, the query submitted to the th database when intended to retrieve should be statistically indistinguishable from the one when intended to retrieve , i.e.,
[TABLE]
where denotes statistical equivalence.
The user needs to decode the desired message reliably from the received answer strings, consequently,
[TABLE]
where as .
An achievable PIR scheme satisfies constraints (8) and (9) for some file size . The download cost is the size of the total downloaded bits from all databases,
[TABLE]
For a given storage constraint vector , we aim to jointly design the placement phase (i.e., , ) and the retrieval scheme to minimize the normalized download cost in the retrieval phase.
3 Main Results
Theorem 1 characterizes the optimal download cost under heterogeneous storage constraints in terms of a linear program. The main ingredients of the proof of Theorem 1 are introduced in Section 4 for , and the complete proof is given in Section 5 for general .
Theorem 1
For PIR from databases with heterogeneous storage sizes , the optimal normalized download cost is the solution of the following linear program,
[TABLE]
where .
Theorem 2 shows the equivalence between the optimum download costs of the heterogeneous and homogeneous problems. The proof of Theorem 2 is given in Section 6.
Theorem 2
The normalized download cost of the PIR problem with heterogeneous storage sizes is equal to the normalized download cost of the PIR problem with homogeneous storage sizes for all databases, i.e., , where is such that , for .
Remark 1
Theorem 2 implies that the storage size asymmetry does not hurt the PIR capacity, so long as the placement phase is optimized. This is unlike, for instance, access asymmetry in the case of replicated databases [37]. This is also unlike, as another instance, non-optimized content placement even for symmetric database sizes [53].
Remark 2
Stronger than what is stated, i.e., the equivalence between heterogeneous and homogeneous storage cases, Theorem 2 in fact implies that the optimal download cost in (1) depends only on the sum storage space . Thus, any distribution of storage space within the given sum storage space yields the same PIR capacity. In particular, a uniform distribution (the corresponding homogeneous case) has the same PIR capacity. Hence, there is no loss in the PIR capacity due to heterogeneity of storage spaces of the databases.
4 Representative Example:
We introduce the main ingredients of the achievability and converse proofs using the example of databases. Without loss of generality, we take in this section.
4.1 Converse Proof
We note that [36, Theorem 1] can be applied to any storage constrained PIR problem with arbitrary storage . Hence, specializing to the case of (and ) with i.i.d. messages and uncoded content leads to [36, eqn. (39)],
[TABLE]
Using the uncoded storage assumption in (3), we can further lower bound (12) as,
[TABLE]
Normalizing with , taking the limit , and using the definition lead to the following lower bound on the normalized download cost ,
[TABLE]
where (16) follows from the message size constraint (4).
We further lower bound (16) by minimizing the right hand side with respect to under storage constraints. Thus, the solution of the following linear program serves as a lower bound (converse) for the normalized download cost,
[TABLE]
where variables are , which represent the content stored in databases 1, 2 and 3 exclusively; variables are , which represent the content stored in databases 1 and 2, 1 and 3, and 2 and 3, respectively; and variable is , which represents the content stored in all three databases simultaneously.
Next, we show that the lower bound expressed as a linear program in (4.1) can be achieved.
4.2 Achievability Proof
In the placement phase, let for all . Assign the partition to the set of the databases for all . To retrieve privately, , the user applies the Sun-Jafar scheme [7] over the partitions of the files.
The partitions , , are placed in a single database each. Thus, we apply [7] with , and download
[TABLE]
The partitions , , are placed in two databases each. Thus, we apply [7] with , and download
[TABLE]
Finally, the partition is placed in all three databases. Thus, we apply [7] with , and download
[TABLE]
Concatenating the downloads, file is reliably decodable. Hence, by summing up the download costs in (18), (19) and (20), we have the following normalized download cost,
[TABLE]
which matches the lower bound in (4.1) and is subject to the same constraints. Hence, the solution to the linear program in (4.1) is achievable, and gives the exact PIR capacity.
4.3 Explicit Storage Assignment
In this section, we solve the linear program in (4.1) to find the optimal storage assignment explicitly for . To that end, we denote , i.e.,
[TABLE]
We first construct a relaxed optimization problem by summing up the three individual storage constraints in (4.1) into a single constraint. The relaxed problem is,
[TABLE]
where we define the sum storage space . Plugging ,
[TABLE]
Since (4.3) is a linear program, the solution lies at the boundary of the feasible set. We have three cases depending on the sum storage space .
Regime 1:
When : In this case, the second constraint in (4.3) requires , while we must have . Hence, there is no feasible solution for the relaxed problem and thus the original problem (4.1) is infeasible as well.
Regime 2:
When : In this case, the constraint is not binding. Hence, the solution satisfies the second constraint with equality, , which is non-negative in this regime. Thus, (4.3) can be written in an unconstrained manner as,
[TABLE]
The optimal solution for (27) is and therefore . From the equality constraint , we have . Next, we map the solution of the relaxed problem in (4.3) to a feasible solution in the original problem in (4.1). From (24), . Thus, at the boundary of the inequality set of (4.1), we have,
[TABLE]
Depending on the sign of , where , we have different content assignments. The common structure of (28)-(30) is . We assign if and otherwise. This ensures that for all . Using these assignments, we have sub-cases depending on the sign of . We summarize explicit content assignment for these cases in Table 1, where we take without loss of generality, to reduce the number of cases to enumerate. With these solutions, the optimal normalized download cost in this regime is,
[TABLE]
where corresponds to the average storage size.
Regime 3:
When : In this case, the solution of (4.3) is at the intersection of the constraints and . Hence, we have and , which are both non-negative in this regime. From the equality constraint , we have . Next, we map the solution of the relaxed problem in (4.3) to a feasible solution in the original problem in (4.1). From (22), implies . From (24), implies . At the boundary of the feasible set of (4.1), we have,
[TABLE]
Plugging and for leads to the following content assignment,
[TABLE]
With these solutions, the optimal normalized download cost in this regime is,
[TABLE]
This solution is also shown in Table 1.
5 Optimal Download Cost for the General Problem
In this section, we give the proof of Theorem 1, i.e., show the achievability and the converse proofs for the PIR problem with heterogeneous databases, for general , , .
5.1 General Achievability Proof
In this section, we show the achievability for general databases and messages. Let denote the optimal normalized download cost for the PIR problem with replicated databases [7] storing the same messages, which is achieved using Sun-Jafar scheme [7],
[TABLE]
We partition the messages over all subsets of , such that for all . Using this partitioning, the subsets such that correspond to a PIR problem with 1 database and messages. Hence, by applying the trivial scheme of downloading all these partitions, we download bits. For the subsets such that , we have a PIR problem with databases and messages. Therefore, by applying Sun-Jafar scheme [7], we download bits, and so on. This results in total normalized download cost of . The optimal content assignment is obtained by optimizing over subject to the message size constraint (4), and the individual storage constraints (5). Thus, the achievable normalized download can be written as the following linear program,
[TABLE]
where .
5.2 General Converse Proof
In this section, we show the converse for general databases and messages. The result in [36, Theorem 1] gives a general lower bound for a PIR system with databases and messages and arbitrary storage contents as
[TABLE]
where is given by,
[TABLE]
For uncoded placement, we have,
[TABLE]
The simplifications in [36], which are intended to deal with the nested harmonic sum, can be applied to the heterogeneous storage as well. Thus, the following lower bound in [36, (77)] is a valid lower bound for the normalized download cost for the heterogeneous problem,
[TABLE]
where
[TABLE]
Substituting (43) in (42) leads to,
[TABLE]
where the last step follows from the message size constraint.
This settles Theorem 1 by having shown that both achievability and converse proofs result in the same linear program which is given in (1).
6 Equivalence to the Homogeneous Problem
We prove Theorem 2, which implies an equivalence between the solution of (1) with heterogeneous storage constraints and the solution of (1) with homogeneous storage constraint for all databases. To that end, let as before. By adding the individual storage size constraints in (1), we write the following relaxed problem,
[TABLE]
where , as before, is the sum storage space and is defined in (37). The solution of the relaxed problem is potentially lower than (1), since the optimal solution of (1) is feasible in (6). Note that the relaxed problem (6) depends only on the sum storage space and the number of databases . Therefore, the corresponding relaxed problem is the same for all distributions of the storage space among databases under the same , including the uniform distribution which results in the homogeneous problem. Thus, in order to show the equivalence of the heterogeneous and homogeneous problems, it suffices to prove that the optimal solution of (6) can be mapped back to a feasible solution of (1).
We write the Lagrangian function corresponding to (6) as,
[TABLE]
The optimality conditions are,
[TABLE]
We have the following structural insights about the relaxed problem. The first lemma states that, in the optimal solution, there are at most two non-zero s.
Lemma 1
There does not exist a subset , such that and for all .
**Proof: ** Assume for sake of contradiction that there exists such that . Hence, for all . From the optimality conditions in (49), we have,
[TABLE]
This results in independent equations in 2 unknowns ( and ), which is an inconsistent linear system if . Thus, we have a contradiction, and can be at most 2.
The second lemma states that if two s are positive, then they must be consecutive.
Lemma 2
If , and , then .
**Proof: ** Assume for sake of contradiction that , , such that , and that where . Then, from the optimality conditions, we have,
[TABLE]
Solving for leads to,
[TABLE]
Since is convex in , we have , which implies , which is impossible since Lagrange multiplier , and from Lemma 1, . Thus, we have a contradiction, and we cannot have a zero between two non-zero s.
The third lemma states that having an integer leads to activating a single only.
Lemma 3
* and for all if and only if , where .*
**Proof: ** From the optimality conditions, we have,
[TABLE]
Substituting from (55) into (56) leads to,
[TABLE]
Since , we can choose an . Then, (57) implies,
[TABLE]
Since is monotonically decreasing in , we have for some positive constant . Since , the inequality must be satisfied with equality. To have a feasible solution for the two equations and , we must have and .
The fourth lemma gives the solution of the relaxed problem for non-integer .
Lemma 4
For the relaxed problem (6), if , then and .
**Proof: ** From Lemma 1, at most two s should be positive. From Lemma 3, exactly two s should be positive, as is not an integer here. From Lemma 2, the positive should be consecutive, and because of continuity, we must have and . Thus, on the boundary, we have,
[TABLE]
Solving these equations simultaneously results in and .
Thus, Lemmas 1-4 establish the structure of the relaxed problem: First, since for all , we have . If , then there is no PIR possible. If is an integer between 1 and , then only one is positive and it is equal to 1. For instance, if , then . In this case, only one type of with subscripts is positive. If is a non-integer between 1 and , then two s are positive. For instance, if , then and are positive and equal to and , respectively. In this case, two types of s with and subscripts are positive.
Finally, to show the equivalence of the original linear program in (1) and the relaxed linear problem in (6), we need to show that a feasible (non-negative) solution of (1) exists for every optimal solution of (6). That is, the optimal s found in solving (6) can be mapped to a set of feasible s in (1). We note that, we have shown this by finding an explicit solution for the case of in Section 4.3. We give an alternative proof for the case of using Farkas’ lemma [57] in Appendix A. In the following lemma, we give the proof for general by using the theory of positive linear dependence in [56].
Lemma 5
There exists a feasible (non-negative) solution of (1) corresponding to the optimal solution of the relaxed problem in (6).
**Proof: ** Since the inequality in the constraint set of the relaxed problem (6) is satisfied with equality, the inequalities in the constraint set of the original problem (1) should be satisfied with equality as well. We know from Lemmas 1-4 that only two s will be positive, therefore, their expressions in terms of the corresponding s will give two more equations. Assuming that , we have and ; is a sum of s and is a sum of s. Thus, we have equations in variables; and, we need to show that a feasible solution to these linear equations exists.
We denote this linear system of equations as where is the vector of , i.e., content assignments, and is the vector of and , i.e., storage constraints and relaxed problem coefficients, i.e.,
[TABLE]
where
[TABLE]
and
[TABLE]
Now, , an matrix of zeros and ones, has the following properties:
Every column of the matrix is unique. 2. 2.
First columns have 1s and 0s in their first rows. Last two elements of these columns are all 1s and all 0s, respectively. 3. 3.
The remaining columns have 1s and 0s in their first rows. Last two elements of these columns are all 0s and all 1s, respectively. 4. 4.
First three properties imply that, in the first rows of the matrix, every permutation of 1s and 0s exist in the first columns; and every permutation of 1s and 0s exist in the next columns.
To clarify the setting with an example, consider and . In this case, we have and . Corresponding to , we have s, which are which sum to . Corresponding to , we have s, which are which sum to . Thus, we have the vector:
[TABLE]
the vector:
[TABLE]
and the matrix:
[TABLE]
Note, in the first 4 rows of , in the first 4 columns we have all possible vectors with only one 1, and in the remaining 6 columns we have all possible vectors with two 1s.
To prove the existence of a feasible solution for , we show that is always a positive linear combination of columns of . From the first statement of [56, Theorem 3.3], we note that if we can find a column of , for instance , such that for all that satisfy , we have ; then is a positive linear combination of the columns of . Note that, from the last property of , if we can find such a column, then we can find an that satisfy one of the following inequalities and vice versa:
[TABLE]
where
[TABLE]
First, we order the variables and , among themselves in the decreasing order and we define and , such that,
[TABLE]
Then, we have the following series of inequalities for all that satisfy :
[TABLE]
where in (73), we use Lemma 4 and insert the values of and , and in (74) we use the rearrangement inequality [58]. We have (75) by using the fact that is between and , where each is a real number between 0 and 1, and by redistributing the values where we maximize the ones that are the coefficients of the largest values. Next, we observe that, is the convex combination of and , which results in (76). Hence, we have,
[TABLE]
for all that satisfy . Finally, (77) shows that we can always find that satisfies either (68) or (69), concluding the proof.
7 Conclusions
We considered a PIR system where a data center places available content into heterogeneous sized databases, from which a user retrieves a file privately. We determined the exact PIR capacity (i.e., the minimum download cost) under arbitrary storage constraints. By showing the achievability of the solution of a relaxed problem where all available storage space is pooled into a sum storage space, by the original problem with individual storage constraints, we showed the equivalence of the heterogeneous PIR capacity to the corresponding homogeneous PIR capacity. Therefore, we showed that there is no loss in PIR capacity due to database storage size heterogeneity, so long as the placement phase is optimized.
Appendix A Alternative Proof for Lemma 5 for
Here, we give an alternative proof of Lemma 5 for using Farkas’ lemma. We illustrate the general idea using the example case . Using Lemma 4, we have and . We want to show the existence of and for all such that,
[TABLE]
This is a linear system with 10 unknowns and 6 equations in the form of , where is the coefficients matrix. To show the existence of a non-negative solution, we use Farkas’ lemma, which states that there exists a non-negative solution that satisfies if and only if for all for which , we have . We transform the system of equations into the reduced-echelon form with:
[TABLE]
with
[TABLE]
and
[TABLE]
Hence, for any , implies,
[TABLE]
Now, we need to show . We have the following for (the worst case):
[TABLE]
where (100) follows from (87)-(96) taking into consideration that and .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] B. Chor, E. Kushilevitz, O. Goldreich, and M. Sudan. Private information retrieval. Journal of the ACM , 45(6):965–981, November 1998.
- 2[2] N. B. Shah, K. V. Rashmi, and K. Ramchandran. One extra bit of download ensures perfectly private information retrieval. In IEEE ISIT , June 2014.
- 3[3] T. Chan, S. Ho, and H. Yamamoto. Private information retrieval for coded storage. In IEEE ISIT , June 2015.
- 4[4] A. Fazeli, A. Vardy, and E. Yaakobi. Codes for distributed PIR with low storage overhead. In IEEE ISIT , June 2015.
- 5[5] R. Tajeddine and S. El Rouayheb. Private information retrieval from MDS coded data in distributed storage systems. In IEEE ISIT , July 2016.
- 6[6] H. Sun and S. A. Jafar. Blind interference alignment for private information retrieval. In IEEE ISIT , July 2016.
- 7[7] H. Sun and S. A. Jafar. The capacity of private information retrieval. IEEE Trans. on Info. Theory , 63(7):4075–4088, July 2017.
- 8[8] H. Sun and S. A. Jafar. The capacity of robust private information retrieval with colluding databases. IEEE Trans. on Info. Theory , 64(4):2361–2370, April 2018.
