Private Information Retrieval in Graph Based Replication Systems

Netanel Raviv; Itzhak Tamo; Eitan Yaakobi

arXiv:1812.01566·cs.IT·March 6, 2019

Private Information Retrieval in Graph Based Replication Systems

Netanel Raviv, Itzhak Tamo, Eitan Yaakobi

PDF

TL;DR

This paper investigates private information retrieval protocols in graph-based storage systems, proposing a scheme that maximizes privacy against certain collusions and analyzing its efficiency and extensions.

Contribution

It introduces a 2-replication PIR scheme that guarantees privacy against acyclic collusions and provides bounds on its rate, extending to larger replication factors and coding.

Findings

01

Guarantees perfect privacy from acyclic sets

02

Achieves PIR rate within a factor of two of optimal for certain graphs

03

Extends results to larger replication factors and graph-based coding

Abstract

In a Private Information Retrieval (PIR) protocol, a user can download a file from a database without revealing the identity of the file to each individual server. A PIR protocol is called $t$ -private if the identity of the file remains concealed even if $t$ of the servers collude. Graph based replication is a simple technique, which is prevalent in both theory and practice, for achieving erasure robustness in storage systems. In this technique each file is replicated on two or more storage servers, giving rise to a (hyper-)graph structure. In this paper we study private information retrieval protocols in graph based replication systems. The main interest of this work is maximizing the parameter $t$ , and in particular, understanding the structure of the colluding sets which emerge in a given graph. Our main contribution is a $2$ -replication scheme which guarantees perfect privacy from…

Tables1

Table 1. TABLE I: Different examples for the choice of G 𝐺 G in Section III . The parameter t 𝑡 t stands for the guaranteed t 𝑡 t -privacy of the system, and d 𝑑 d denotes the fixed degree of the vertices in the graph.

	$n$	$s$	$t$	$d$	PIR rate
Petersen	$15$	$10$	$4$	$3$	$\frac{1}{10}$
Complete bipartite	Square	$2 \sqrt{n}$	$3$	$\sqrt{n}$	$\frac{1}{2 \sqrt{n}}$
	$O (q^{3})$	$O (q^{2})$	$5$	$q + 1$	$O (n^{- 2 / 3})$
	$O (q^{4})$	$O (q^{3})$	$7$	$q + 1$	$O (n^{- 3 / 4})$
Gen. polygons	$O (q^{6})$	$O (q^{5})$	$11$	$q + 1$	$O (n^{- 5 / 6})$
Murty	$p^{2 m} (p^{m} + 2)$	$2 p^{2 m}$	$4$	$p^{m} + 2$	$O (n^{- 2 / 3})$
Ramanujan	Any	$\frac{2 n}{d}$	$O (\log n)$	Constant	$\frac{d}{2 n}$

Equations64

I ({q_{j}}_{j \in T}; ϕ) = 0,

I ({q_{j}}_{j \in T}; ϕ) = 0,

Q ≜ diag (γ) \cdot I_{ϕ} \cdot diag (α),

Q ≜ diag (γ) \cdot I_{ϕ} \cdot diag (α),

\mathbbold 1 \cdot diag (γ)^{- 1} diag (γ) I_{ϕ} diag (α) X = \mathbbold 1 \cdot I_{ϕ} diag (α) X = (h - 1) α_{ϕ} x_{ϕ},

\mathbbold 1 \cdot diag (γ)^{- 1} diag (γ) I_{ϕ} diag (α) X = \mathbbold 1 \cdot I_{ϕ} diag (α) X = (h - 1) α_{ϕ} x_{ϕ},

A = * * * * ⋱ ⋱ * * h - 1,

A = * * * * ⋱ ⋱ * * h - 1,

T = T (S, ϕ) ≜ (k = 1 ⋂ ℓ E (C_{k})) ∖ k = 1 ⋃ ℓ^{'} E (C_{k}^{'}),

T = T (S, ϕ) ≜ (k = 1 ⋂ ℓ E (C_{k})) ∖ k = 1 ⋃ ℓ^{'} E (C_{k}^{'}),

rank (A^{C}) = {∣ E (C) ∣ ∣ E (C) ∣ - 1 \mbox i f ϕ \in E (C) \mbox i f ϕ \in / E (C) .

rank (A^{C}) = {∣ E (C) ∣ ∣ E (C) ∣ - 1 \mbox i f ϕ \in E (C) \mbox i f ϕ \in / E (C) .

min \mathbbold 1_{s} \cdot μ^{⊤} \mbox, s u bj ec tt o I (G)^{⊤} μ^{⊤} \geq \mathbbold 1_{n} \mbox an d μ \geq 0,

min \mathbbold 1_{s} \cdot μ^{⊤} \mbox, s u bj ec tt o I (G)^{⊤} μ^{⊤} \geq \mathbbold 1_{n} \mbox an d μ \geq 0,

max \mathbbold 1_{n} \cdot η^{⊤} \mbox, s u bj ec tt o I (G) η^{⊤} \leq \mathbbold 1_{s} \mbox an d η \geq 0.

max \mathbbold 1_{n} \cdot η^{⊤} \mbox, s u bj ec tt o I (G) η^{⊤} \leq \mathbbold 1_{s} \mbox an d η \geq 0.

\displaystyle(\textbf{q}_{j})_{t}=\begin{cases}\gamma_{j}\cdot\alpha_{t}\cdot h^{\delta(t,m)}&\mbox{ if }j\mbox{ contains a codeword symbol of~{}$\textbf{x}_{t}$}\\ 0&\mbox{else}\end{cases},

\displaystyle(\textbf{q}_{j})_{t}=\begin{cases}\gamma_{j}\cdot\alpha_{t}\cdot h^{\delta(t,m)}&\mbox{ if }j\mbox{ contains a codeword symbol of~{}$\textbf{x}_{t}$}\\ 0&\mbox{else}\end{cases},

j \in L_{1} \sum γ_{j}^{- 1} a_{j}^{⊤}, \dots, j \in L_{N} \sum γ_{j}^{- 1} a_{j}^{⊤}

j \in L_{1} \sum γ_{j}^{- 1} a_{j}^{⊤}, \dots, j \in L_{N} \sum γ_{j}^{- 1} a_{j}^{⊤}

(e)_{m}

(e)_{m}

J^{(1)}

J^{(1)}

J^{(2)}

⋮

J^{(r)}

j \in L_{1} \sum γ_{j}^{- 1} a_{j}^{⊤}, \dots, j \in L_{N} \sum γ_{j}^{- 1} a_{j}^{⊤}

j \in L_{1} \sum γ_{j}^{- 1} a_{j}^{⊤}, \dots, j \in L_{N} \sum γ_{j}^{- 1} a_{j}^{⊤}

(e^{'})_{m}

(e^{'})_{m}

\frac{b f}{s \cdot ( f / K ) \cdot r} = b \cdot \frac{K}{sr} = \frac{r ( N - K )}{K} \cdot \frac{K}{sr} = \frac{N - K}{s} .

\frac{b f}{s \cdot ( f / K ) \cdot r} = b \cdot \frac{K}{sr} = \frac{r ( N - K )}{K} \cdot \frac{K}{sr} = \frac{N - K}{s} .

{1, 5, 9}

{1, 5, 9}

{1, 6, 10}

{1, 7, 11}

{1, 8, 12}

{{1, a, b} ∣ {a, b} \in M_{1}}

{{1, a, b} ∣ {a, b} \in M_{1}}

a = k \in E (C) ∖ {j} \sum m_{k} c_{k},

a = k \in E (C) ∖ {j} \sum m_{k} c_{k},

dim (X)

dim (X)

X

rank (A^{C}) = {∣ E (C) ∣ ∣ E (C) ∣ - 1 \mbox i f ϕ \in E (C) \mbox i f ϕ \in / E (C) .

rank (A^{C}) = {∣ E (C) ∣ ∣ E (C) ∣ - 1 \mbox i f ϕ \in E (C) \mbox i f ϕ \in / E (C) .

α_{b}

α_{b}

h γ_{g} α_{ϕ}

h γ_{g} α_{ϕ}

γ_{f} α_{ϕ}

121212121212,

121212121212,

J^{(1, 1)}

J^{(1, 1)}

J^{(2, 1)}

J^{(3, 1)}

(121213132323)

(121213132323)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\title

Private Information Retrieval in\Graph Based Replication Systems

Abstract

In a Private Information Retrieval (PIR) protocol, a user can download a file from a database without revealing the identity of the file to each individual server. A PIR protocol is called $t$ -private if the identity of the file remains concealed even if $t$ of the servers collude. Graph based replication is a simple technique, which is prevalent in both theory and practice, for achieving erasure robustness in storage systems. In this technique each file is replicated on two or more storage servers, giving rise to a (hyper-)graph structure. In this paper we study private information retrieval protocols in graph based replication systems. The main interest of this work is maximizing the parameter $t$ , and in particular, understanding the structure of the colluding sets which emerge in a given graph. Our main contribution is a $2$ -replication scheme which guarantees perfect privacy from acyclic sets in the graph, and guarantees partial-privacy in the presence of cycles. Furthermore, by providing an upper bound, it is shown that the PIR rate of this scheme is at most a factor of two from its optimal value for an important family of graphs. Lastly, we extend our results to larger replication factors and to graph-based coding, which is a similar technique with smaller storage overhead and larger PIR rate.

I Introduction

Recent data breaches in major corporations have emphasized the need for privacy in the digital era. Among the many challenges that designers of distributed storage systems face is the ability to support private information retrieval (PIR) protocols. These protocols enable the end user to retrieve an entry of the database, while concealing the identity of that entry from the servers. This paper studies PIR protocols in a particular common type of distributed storage systems.

Coding for storage systems has developed tremendously in recent years. However, many system designers still favor replication techniques, over more involved ones, as a means to guarantee robustness against hardware failures [12, 5]. In spite of having high storage overhead and low failure resilience, replication is often preferred due to its simplicity of implementation. In addition, various types of replication systems are studied in theoretical research due to their real-world impact and ease of analysis [18, 29, 30, 9, 19]. However, since contemporary datasets are far too large to be stored on one machine, it is usually the case where every machine stores a small number of selected files from the dataset, each of which is replicated among geographically separated machines. In turn, such systems can be modeled as hypergraphs, where nodes represent storage servers and (hyper-)edges represent files. In these graphs, an edge is incident with a node if a copy of the respective file is stored on the respective server. Storage systems which broadly adhere to the above outline are called graph-based replication systems. A graph based replication system in which every file is replicated $r$ times is called an $r$ -replication system, and $r$ is called its replication factor.

One of the most important metrics by which PIR protocols are measured is their collusion resistance. In its most simplistic form, a PIR protocol must guarantee perfect privacy against every individual server111In some settings, only computational privacy is required, but this paper focus exclusively on perfect privacy.. That is, it should be computationally impossible for every individual server to infer any information regarding the identity of the requested file. The term collusion resistance measures the ability of a PIR protocol to perform beyond this baseline. That is, what is the maximum number of servers that still remain completely oblivious to the identity of the file, even if collusion among them is permitted. Traditionally, the term “collusion” stems from a mindset which considers the servers themselves as adversaries. Yet, the authors of this paper deem this interpretation obsolete, since it does not align with contemporary storage services. Instead, one can think of geographically separated servers as having independent security protocols, that must be individually broken by an adversary. In this case, the term “colluding servers” refers to a set of servers whose security was breached by an outside adversary, that can therefore observe their input and output. Normally, the term $t$ -privacy of a given protocol indicates the maximum number of servers that cannot infer any information regarding the identity of the file even if they collude; and in our alternative viewpoint, $t+1$ is the minimum number of individually-secured servers that must be breached by an adversary in order to infringe the perfect privacy of the protocol. Nevertheless, in our choice of terms we comply with the standard nomenclature.

PIR protocols have been studied extensively in the past years, and many additional metrics of interest were defined. Among the metrics of interests are: (a) the PIR rate, which measures the ratio between the size of the desired data and the size of the downloaded one; (b) the upload complexity, which measures the size of the queries that are sent to the servers; and (c) the storage overhead, which measures the amount of redundancy in the system. While our main concern is understanding the collusion resistance of the system, we also address some of these metrics in our analysis.

In this paper we initiate a study about PIR protocols in graph based replication systems, and our primary focus is studying their collusion resistance. Since such systems are inherently non-uniform, in the sense that every server stores a different part of the dataset, one might expect that the collusion resistance will act accordingly. Indeed, our results show that the right viewpoint for analyzing colluding sets is not their size, but rather the structure of their induced subgraph. In particular, perfect privacy is maintained if the colluding sets do not contain certain sub-graphs.

Our results shed light on the design of such systems in a bilateral manner. On one hand, we provide recommendations for system designers regarding the file dispersion in the system. On the other hand, we provide a way for analyzing the collusion resistance of a given system. In particular, we provide a PIR protocol for $2$ -replication systems and show that its PIR rate at least half of its optimal value in many cases of interest. For larger replication factors we provide a simple scheme whose collusion resistance is less than the replication factor, and another scheme which obtains a larger collusion resistance by a reduction to the two $2$ -replication case.

Further, we suggest an alternative graph-based coding approach, in which every file is coded by using an MDS code, and the resulting codeword symbols are dispersed as in graph-based replication systems. While this approach reduces the storage overhead and increases the PIR rate, it requires a careful file dispersion in order to guarantee high collusion resistance. The results in this paper, and graph-based coding in particular, call for future research and practical implementations, that would hopefully bring the vast PIR literature closer to realistic storage systems.

This paper is structured as follows. Preliminaries and previous works are discussed in Section II. Protocols and bounds for $2$ -replication systems are given in Section III, and larger replication factors are discussed in Section IV. Then, graph-based coding is discussed in Section V, and open problems for future research are discussed in Section VI.

II Preliminaries

For a prime power $q$ let $\mathbb{F}_{q}$ be the field with $q$ elements. In a PIR protocol (not necessarily a graph-based one), a dataset $X=(\textbf{x}_{1}^{\top},\ldots,\textbf{x}_{n}^{\top})^{\top}\in\mathbb{F}_{q}^{n\times f}$ , which consists on $n$ files $\{\textbf{x}_{i}\}_{i=1}^{n}$ , is stored across $s$ storage servers in a possibly coded manner. The user wishes to download the file $\textbf{x}_{\phi}$ , where for the sake of the probabilistic analysis, $\phi$ is seen as uniformly distributed over $[n]\triangleq\{1,2,\ldots,n\}$ . To this end, the user uses randomness in order to generate queries $\textbf{q}_{1},\ldots,\textbf{q}_{s}$ , one for every server. In turn, server $i$ replies with $\textbf{a}_{i}$ , that is a deterministic function of $\textbf{q}_{i}$ and the server’s content. The protocol is called $t$ -private if for every subset $\mathcal{T}\subseteq[s]$ of size at most $t$ ,

[TABLE]

where $I$ denotes mutual information. Alternatively, the protocol is $t$ -private if $\{\textbf{q}_{j}\}_{j\in\mathcal{T}}$ and $\phi$ are independent. Finally, the PIR rate of the system is $f/\sum_{i\in[s]}|\textbf{a}_{i}|$ , i.e., the ratio between the size of the desired data and the amount of downloaded one, both measured in $\mathbb{F}_{q}$ symbols.

In a graph-based replication system every file is replicated multiple times and each one of the copies is stored on a different server. If all files are replicated an identical number of times $r$ , we say that it is an $r$ -replication system, and $r$ is its replication factor. In a 2-replication system a graph structure arises, in which nodes represent servers, edges represent files, and an edge is incident with a node if the respective file is stored on the respective server. Similarly, in $r$ -replication systems for $r>2$ an $r$ -uniform hypergraph222That is, a hypergraph in which all edges contain an identical number of nodes. structure arises, and in systems where every file is replicated a different number of times, a non-uniform hypergraph arises. Notice that for $r=2$ , a multigraph333A multigraph is a graph in which a certain edge can appear multiple times. Multiple occurrences of the same edge are called parallel edges. might arise, in cases where there exist two servers that share more than one file in common. While our analysis does not exclude these cases, they result in poor collusion resistance and impede the overall message. Therefore, we restrict our attention to systems in which every two servers store at most one file in common (see Remark 7 for further discussion).

Graphs are denoted by $G=(E,V)$ , where $E=\{e_{1},e_{2},\ldots\}$ and $V=\{v_{1},v_{2},\ldots\}$ . Unless otherwise stated, all graphs in this paper are undirected, and hence, an edge is a subset of vertices (subset of size two in ordinary graphs, and of arbitrary size in hypergraphs). For a given graph $G^{\prime}$ we denote its set of edges by $E(G^{\prime})$ and its set of vertices by $V(G^{\prime})$ . Since graphs represent storage systems in this paper, the terms node, vertex, and server are used interchangeably, and so does the terms edge and file.

For a graph $G$ and a subset $\mathcal{S}\subseteq V(G)$ we denote by $G_{\mathcal{S}}$ the subgraph induced by $\mathcal{S}$ , i.e., the graph which consists of the nodes in $\mathcal{S}$ and all the edges in $E(G)$ that both of their incident nodes are in $\mathcal{S}$ . A cycle in $G$ is a subgraph of $G$ whose nodes are $\{v_{i}\}_{i=0}^{t-1}$ for some $t$ , and whose edges are $\{v_{i},v_{i+1\bmod t}\}_{i=0}^{t-1}$ , and these edges exist also in $E(G)$ . An edge $e$ is said to be incident with a vertex $v$ , and vice versa, if $v\in e$ . The set of edges in $E(G)$ that are incident with $v$ are denoted by $\Gamma_{G}(v)$ , where $G$ is omitted if clear from context. The incidence matrix $I(G)$ of a graph $G$ is a $|V(G)|\times|E(G)|$ binary matrix in which rows correspond to nodes and columns correspond to edges, and an entry contains $1$ if and only if the respective vertex is incident with the respective edge. In the sequel, the well-known Breadth First Search (BFS) algorithm is used repeatedly, in graphs as well as in hypergraphs, and the uninformed reader is referred to [7].

In all subsequent protocols, the queries $\textbf{q}_{1},\ldots,\textbf{q}_{s}$ are vectors in $\mathbb{F}_{q}^{n}$ , i.e., they contain a field element for every file. However, since the servers contain only a portion of the files in the system, the user communicates only their support to the servers. We denote by $Q$ the $s\times n$ matrix whose $i$ ’th row is $\textbf{q}_{i}$ for every $i\in[s]$ , and note that it is a random variable that depends on $\phi$ , and on the randomness at the user. In cases where $\phi$ is fixed, we denote the matrix of queries by $Q|\phi$ .

Since submatrices are used repeatedly, we define the following notation. For a matrix $A\in\mathbb{F}^{s\times n}$ and sets $\mathcal{S}\subseteq[s]$ and $\mathcal{N}\subseteq[n]$ , let $A_{\mathcal{S},\mathcal{N}}$ be the submatrix of $A$ that consists of the rows in $\mathcal{S}$ and the columns in $\mathcal{N}$ . Further, let $A_{:,\mathcal{N}}\triangleq A_{[s],\mathcal{N}}$ and $A_{\mathcal{S},:}\triangleq A_{\mathcal{S},[n]}$ . For vectors $\textbf{a}\in\mathbb{F}_{q}^{n}$ and $\textbf{b}\in\mathbb{F}_{q}^{s}$ we define $\textbf{a}_{\mathcal{N}}$ and $\textbf{b}_{\mathcal{S}}$ analogously. For convenience, we consider the rows and columns of a matrix $A_{\mathcal{S},\mathcal{N}}$ as indexed by $\mathcal{S}$ and $\mathcal{N}$ , respectively, rather than by $[|\mathcal{S}|]$ and $[|\mathcal{N}|]$ . For example, if $n=s=4$ and $\mathcal{S}=\mathcal{N}=\{2,3\}$ , then $A_{\mathcal{S},\mathcal{N}}$ is a $2\times 2$ matrix whose entries are indexed by $(2,2),(2,3),(3,2),(3,3)$ . Since submatrices of $Q$ are in strong correspondence with subgraphs of $G$ , for every subgraph $T$ of $G$ (denoted $T\subseteq G$ ) we denote $Q^{T}\triangleq Q_{V(T),E(T)}$ , and similarly, for every vector $\textbf{v}\in\mathbb{F}_{q}^{s}$ we define $\textbf{v}^{T}\triangleq\textbf{v}_{V(T)}$ .

By and large, we use lower-case letters ( $a,b,c,\ldots$ ) to denote scalars, boldface letters ( $\textbf{a},\textbf{b},\textbf{c},\ldots$ ) to denote vectors (all of which are row vectors), capital letters ( $A,B,C,\ldots$ ) to denote matrices or graphs, and calligraphic letters ( $\mathcal{A},\mathcal{B},\mathcal{C},\ldots$ ) to denote sets. Finally, we use the standard notation $[N,K]_{q}$ to denote a linear code of length $N$ and dimension $K$ over $\mathbb{F}_{q}$ .

II-A Previous work

Originally defined in [6], the PIR problem has attracted a tremendous amount of research in the past two decades; and due to its tight connection with distributed storage, PIR enjoyed an increasing attention in the past few years. Since a comprehensive summary of previous works is beyond the scope of this paper, we list herein only a partial list of recent contributions, and elaborate on the most relevant ones.

The recent surge of interest in PIR, which addresses the problem from a distributed storage standpoint, includes the reduction of storage overhead by using error correcting codes in [10] and its improvement in [3]; obtaining secrecy by one extra bit in [17] and its improvement in [4]; and an extensive line of works regarding achievability and capacity in various scenarios, such as multi-round, multi-message, symmetric, and with byzantine or colluding servers [20, 21, 23, 1, 26, 2, 22]. This line of works is a natural extension of an earlier one in the computer science community, which addressed the problem in a more simplistic fashion. Namely, the dataset is assumed to be replicated in its entirety on all servers in the system, and the files are assumed to consist of a single bit. Furthermore, this problem is strongly connected to locally decodable codes [27, 28], and has seen a substantial progress recently [8].

All of the aforementioned works fall into either one of two extremes in the approach towards PIR. In one, the dataset in its entirety is stored in every server, and in the other it is coded by using an MDS code. The current work addresses a sweet spot between the two, that is strongly motivated by real-world applications [12, 5], as well as a plethora of storage models that were addressed in the past [18, 9, 29, 30, 19].

Nevertheless, two notions that are relevant to this work were recently addressed in the literature. First, one may consider the special case of graph-based replication in which the degree444The degree of a node in a graph is the number of edges that are incident with it. of the nodes in the graph is upper bounded by some parameter. Evidently, this special case is strongly connected to a recent work [25], that addressed the general coded PIR question in cases where each server is constrained to contain only a fraction of the entire dataset. Yet, [25] did not impose the particular replication structure that is fundamental to our approach, and more importantly, did not consider collusion. Furthermore, we emphasize that our graph-based approach is highly flexible, in the sense that no constraint is imposed other than every file being replicated on a subset of the servers.

Another notion that was previously studied is that of collusion patterns [24, 13]. In this variant, the system must guarantee collusion resistance against specific subsets of servers, rather than any subset up to a certain size. This notion bears some similarity to this work, since one may compel the vertices in these specific sets not to induce a subgraph which infringes privacy in our scheme. However, the approach and the results of these works is substantially different from ours, e.g., since [24] only discuss coded storage, and [13] discussed replication of the entire dataset in every server, and disjoint colluding sets.

III Replication factor two

III-A A PIR protocol for 2-replication systems

In this section it is assumed that the replication factor is two, and that every two servers store at most one file in common (see Remark 7), which results in a graph $G=(V,E)$ . The scheme applies for any field $\mathbb{F}_{q}$ with at least three elements. Upon requiring file $\textbf{x}_{\phi}$ , the user randomly chooses a vector $\boldsymbol{\alpha}=(\alpha_{i})_{i=1}^{n}\in(\mathbb{F}_{q}^{*})^{n}$ , a vector $\boldsymbol{\gamma}=(\gamma_{i})_{i=1}^{s}\in(\mathbb{F}_{q}^{*})^{s}$ , and an element $h\in\mathbb{F}_{q}\setminus\{0,1\}$ , all uniformly at random, and defines

[TABLE]

where $I_{\phi}$ is obtained from $I(G)$ by replacing the lower $1$ -entry in each column with $-1$ , and then replacing the $1$ -entry in column $\phi$ by $h$ .

Let $\textbf{q}_{j}$ , the query for server $j$ , be the $j$ -th row of $Q$ . Clearly, to upload this row we only need to send the values of its nonzero entries, and hence the total upload complexity is $2n$ . Each node responds with $\textbf{a}_{j}=\textbf{q}_{j}\cdot X$ , and therefore the download complexity is $sf$ , and the PIR rate is $1/s$ . Note that node $j$ can calculate the inner product since the support of $\textbf{q}_{j}$ contains only the indices of the files available to it. Upon receiving the information from all $s$ servers, the user has access to $QX=\operatorname{diag}(\boldsymbol{\gamma})I_{\phi}\operatorname{diag}(\boldsymbol{\alpha})X$ . Then, by multiplying from the left by the matrix $\operatorname{diag}(\boldsymbol{\gamma})^{-1}$ and by the all ones vector $\mathbbold{1}$ , the user get

[TABLE]

and hence $\textbf{x}_{\phi}$ can be recovered. We proceed with studying the collusion resistance of the suggested scheme. The following claim is a special case of a more general one that is given in the sequel (Theorem 4). Nevertheless, it is given here in its current form to maintain simplicity and flow, and its proof is sketched.

Proposition 1.

For any set of servers $\mathcal{S}\subseteq V$ such that $G_{\mathcal{S}}$ does not contain a cycle, we have that $I(\{\textbf{q}_{i}\}_{i\in\mathcal{S}};\phi)=0$ .

Proof sketch.

To prove the claim, we analyze the submatrix of queries that is seen by $\mathcal{S}$ . For clarity, we omit zero columns from this matrix, as well as columns of weight one, since the latter ones are obviously purely random, and cannot cause leakage of information. Hence, the matrix we analyze is chosen according to the random variable $Q^{G_{\mathcal{S}}}$ .

It is evident that every matrix which is chosen according to $Q^{G_{\mathcal{S}}}$ has support which is identical to that of $I(G)^{G_{\mathcal{S}}}$ . In what follows we explain why every $|V(G_{\mathcal{S}})|\times|E(G_{\mathcal{S}})|$ matrix $M$ whose support is identical to that of $I(G)^{G_{\mathcal{S}}}$ can be obtained by some choice of $\boldsymbol{\gamma},\boldsymbol{\alpha}$ , and $h$ with identical probability, regardless of the value of $\phi$ . Consequently, this proves that no information regarding $\phi$ is leaked.

We calculate $\Pr(Q^{G_{\mathcal{S}}}=M)$ by an iterative process that follows a Breadth First Search (BFS) transversal on $G_{\mathcal{S}}$ . Pick an arbitrary $v_{i}\in\mathcal{S}$ , and fix the value of the corresponding $\gamma_{i}$ (with probability one). Clearly, it follows that $\Pr(\gamma_{i}\cdot\alpha_{j}\cdot(I_{\phi})_{i,j}=M_{i,j})=(q-1)^{-1}$ for every $e_{j}\in\Gamma_{G_{\mathcal{S}}}(v_{i})$ regardless of whether or not $(I_{\phi})_{i,j}$ is the entry of $I_{\phi}$ which is multiplied by $h$ . Having the values of $\alpha_{j}$ for every $e_{j}\in\Gamma_{G_{\mathcal{S}}}(v_{i})$ fixed, we have that $\Pr(\gamma_{j^{\prime}}\cdot\alpha_{j}\cdot(I_{\phi})_{j^{\prime},j})=(q-1)^{-1}$ for the same reasons, where $v_{j^{\prime}}$ is the other end of edge $e_{j}$ (again, regardless of whether or not $(I_{\phi})_{j^{\prime},j}$ is the entry of $I_{\phi}$ which is multiplied by $h$ ). In other words, we have that fixing an entry in $\boldsymbol{\gamma}$ which corresponds to some $v\in V(G_{\mathcal{S}})$ compels us to fix the values in $\boldsymbol{\alpha}$ which correspond to all of $\Gamma_{G_{\mathcal{S}}}(v)$ . In turn, fixing these entries of $\boldsymbol{\alpha}$ compels us to fix the values of $\boldsymbol{\gamma}$ at the other endpoints of the edges in $\Gamma_{G_{\mathcal{S}}}(v)$ . Since $G_{\mathcal{S}}$ does not contain a cycle, we may proceed in a BFS fashion and have that every edge-node incidence in $G_{\mathcal{S}}$ reduces the overall probability of obtaining $M$ by $(q-1)^{-1}$ . Hence, every such matrix $M$ is obtained with probability $(q-1)^{-|M|}$ , where $|M|$ is the size of the support of $M$ , and regardless of the value of $\phi$ . Hence, perfect privacy is guaranteed. ∎

We now turn to study how gracefully the perfect privacy deteriorates if $\mathcal{S}$ contains one or more cycles, i.e., how much of $\phi$ ’s identity is revealed.

Proposition 2.

For any cycle $C=(V^{\prime},E^{\prime})$ in $G$ , any matrix $M$ in the support of the random variable $Q^{C}$ is invertible if and only if $e_{\phi}\in E^{\prime}$ .

Proof.

Let $A\triangleq\operatorname{diag}(\boldsymbol{\gamma}_{V^{\prime}})^{-1}M\operatorname{diag}(\boldsymbol{\alpha}_{E^{\prime}})^{-1}$ , and observe that $\operatorname{rank}(A)=\operatorname{rank}(M)$ . If $\phi\notin E^{\prime}$ , then each column of $A$ has two nonzero entries $1$ and $-1$ . Hence, $\mathbbold{1}$ is in its left kernel, and thus $\operatorname{rank}(A)<c$ , where $c\triangleq|V^{\prime}|=|E^{\prime}|$ . Moreover, it is an easy exercise to show that any set of $c-1$ columns of $A$ are linearly independent, and hence $\operatorname{rank}(A)=c-1$ .

On the other hand if $\phi\in E^{\prime}$ , assume without loss of generality that $A$ is of the form

[TABLE]

where $*$ denotes a nonzero entry. Then, $\det A=(-1)^{c-1}h\cdot\det A_{1}-\det A_{2}$ , where $A_{1}$ (resp. $A_{2}$ ) is the bottom-left (resp. top-left) $(c-1)\times(c-1)$ submatrix of $A$ . Notice that $\det A_{1}$ is the product of all $*$ -entries in the sub-diagonal of $A$ , and that $\det A_{2}$ is product of all $*$ -entries in the main diagonal of $A$ . Hence, since every pair of $*$ -entries in any given column are negations of one another, it follows that $\det A_{1}=(-1)^{c-1}\det A_{2}$ . Thus, $\det A=(-1)^{2c-2}h\cdot\det A_{2}-\det A_{2}=(h-1)\det A_{2}\neq 0$ . ∎

Corollary 3.

A set $\mathcal{S}\subseteq V$ such that $G_{\mathcal{S}}$ contains cycles can narrow down the possible values of $e_{\phi}$ (and hence, of $\phi$ itself) to

[TABLE]

where $C_{1},\ldots,C_{\ell}$ are all cycles in $G_{\mathcal{S}}$ that contain555For $\ell=0$ we formally define $\bigcap_{k=1}^{\ell}E(C_{k})=E$ . $e_{\phi}$ , and $C_{1}^{\prime},\ldots,C_{\ell^{\prime}}^{\prime}$ are all cycles in $G_{\mathcal{S}}$ that do not contain $e_{\phi}$ .

Proof.

Let $M$ be the matrix that is seen by $\mathcal{S}$ ; chosen according to the random variable $Q^{G_{\mathcal{S}}}$ . By Proposition 2, the colluding servers can compute the rank of $M^{C}$ for every cycle $C$ in their induced subgraph, and deduce if $e_{\phi}\in E(C)$ accordingly. ∎

We now show that Corollary 3 is in some sense the best that the colluding servers can hope for. Formally, we show that conditioned by $e_{\phi}\in\mathcal{T}$ , all respective possible queries are obtained with identical probability. The immediate conclusion is that out of the $\log n$ protected bits of $\phi$ , the information leakage if a set $\mathcal{S}$ collude is precisely $\log n-\log|\mathcal{T}|$ ; or, differently put, all files in $\mathcal{T}$ are equally likely.

To state the main theorem of this paper, whose proof is given in Appendix A, and of which Proposition 1 is a special case, we require the following definition. For $\mathcal{S}\subseteq V$ and $\mathcal{D}\subseteq E$ , we say that a matrix in $\mathbb{F}_{q}^{|\mathcal{S}|\times|\mathcal{D}|}$ is $(\mathcal{S},\mathcal{D})$ -compatible with $G$ ( $(\mathcal{S},\mathcal{D})$ -compatible, for short) if its support coincides with that of $I(G)_{\mathcal{S},\mathcal{D}}$ . This definition extends naturally to a subgraph $T\subseteq G$ where a matrix in $\mathbb{F}_{q}^{|V(T)|\times|E(T)|}$ is said to be $T$ -compatible if it is $(V(T),E(T))$ -compatible.

Theorem 4.

For every subgraph $T\subseteq G$ , the support of the random variable $Q^{T}|\phi$ is the set of all matrices $A\in\mathbb{F}_{q}^{|V(T)|\times|E(T)|}$ such that:

(a)

$A$ * is $T$ -compatible with $G$ ; and*

(b)

for every cycle $C\subseteq T$ ,

[TABLE]

Furthermore, the random variable $Q^{T}|\phi$ is uniformly distributed on its support.

First, it is evident that the case where $T$ is acyclic in Theorem 4 proves Proposition 1. Second, we have the following corollary.

Corollary 5.

For every set $\mathcal{S}\subseteq V$ and every two distinct values $\phi_{1},\phi_{2}\in[n]$ such that $\phi_{2}\in\mathcal{T}(\mathcal{S},\phi_{1})$ , the servers in $\mathcal{S}$ cannot infer if $\phi=\phi_{1}$ or $\phi=\phi_{2}$ .

Proof.

Clearly, it suffices to prove that the random variables $Q^{G_{\mathcal{S}}}|(\phi=\phi_{1})$ and $Q^{G_{\mathcal{S}}}|(\phi=\phi_{2})$ are identical, i.e., the same queries are obtained with identical probabilities. Since both random variables are uniformly distributed on their support by Theorem 4, it suffices to prove that their supports are identical. Also by Theorem 4, it suffices to prove that the conditions (a) and (b) coincide in both cases. For (a) this claim is clear since it does not depend on the value of $\phi$ . For condition (b), we need to prove that $\phi_{1}\in E(C)$ if and only if $\phi_{2}\in E(C)$ for every cycle $C$ in $G_{\mathcal{S}}$ , which is precisely the meaning of $\phi_{2}\in\mathcal{T}(\mathcal{S},\phi_{1})$ . ∎

We now turn to present several choices of the graph $G$ , and the resulting privacy of the PIR schemes. These examples are summarized in Table I.

Example 6.

Taking $G$ to be the Petersen graph (a $3$ -regular graph with $10$ nodes, $15$ edges, and girth $5$ ) allows to store $15$ files on $10$ servers, $3$ files on each, where any $4$ servers cannot infer any information regarding $\phi$ . According to the structure of the Petersen graph, at least $8$ servers are required to infer the exact identity of $\phi$ . The upload complexity is $30$ field elements, and the download complexity is $10f$ field elements, i.e., the PIR rate is $0.1$ . 2. 2.

Taking $G=(\mathcal{L}\cup\mathcal{R},\mathcal{V})$ to be the complete bipartite graph, with $n$ a square integer and $|\mathcal{L}|=|\mathcal{R}|=\sqrt{n}$ , allows to store $n$ files on $2\sqrt{n}$ servers. To retrieve a file $\textbf{x}_{\phi}$ , the user downloads $2\sqrt{n}\cdot f$ field elements. The resulting system ensures perfect privacy against all sets $\mathcal{S}\subseteq\mathcal{L}\cup\mathcal{R}$ such that either $|\mathcal{S}\cap\mathcal{L}|\leq 1$ or $|\mathcal{S}\cap\mathcal{R}|\leq 1$ , and in particular, all sets of size three. 3. 3.

Graphs of large (constant) girth $g$ are particularly useful since all sets with at most $g-1$ nodes are cycle-free, and hence the resulting protocol is $(g-1)$ -private. These can be obtained as incidence graphs of generalized polygons **[18, Table I]**, of which Item 2 above is a special case. In particular, for prime power $q$ , there exist explicit graphs with degree $q+1$ with $s\in\{O(q^{2}),O(q^{3}),O(q^{5})\}$ (and hence $n\in\{O(q^{3}),O(q^{4}),O(q^{6})\}$ ), where $g\in\{6,8,12\}$ , respectively. The respective download complexities are $O(n^{2/3})\cdot f$ , $O(n^{3/4})\cdot f$ , and $O(n^{5/6})\cdot f$ . 4. 4.

Let $p\geq 5$ be a prime, and let $m$ be a positive integer. The Murty graph **[16]** is a $(p^{m}+2)$ -regular graph with $s=2p^{2m}$ nodes, $n=p^{2m}(p^{m}+2)$ edges, and girth five. In the resulting system, a database of $n$ files is stored on $O(n^{2/3})$ servers, $O(n^{1/3})$ files in each, and ensures perfect privacy against any four colluding servers. To retrieve a file, a user downloads $O(n^{2/3})\cdot f$ field elements. 5. 5.

Ramanujan graphs (e.g., **[15]**) with $n$ edges and constant degree have girth $O(\log n)$ . Hence, the system is resilient against any $O(\log n)$ colluding servers, but require download of $\delta nf$ field elements for some $\delta\in(0,1)$ .

Remark 7.

It is evident that the correctness of the scheme and its privacy guarantees hold also in cases where there exist two servers that store more than one file in common. However, in the resulting multigraph, these two servers form a cycle, and hence can collude to infer some information regarding the identity of $\textbf{x}_{\phi}$ . On the one hand, the system designer may choose to disperse the files while ignoring the aforementioned restriction in order to increase the number of files in the system, at the price of diminishing its privacy guarantees. On the other hand, if the system is designed such that every two servers store at most one file in common, it is clear that $n\leq{s\choose 2}$ .

III-B Bound

In this subsection we explore the limitations of PIR protocols for graph-based replication systems by proving a bound on the PIR rate. The resulting bound is particularly powerful for the important family of regular graphs, for which the bound is within a factor of two from the rate in Subsection III-A. We prove the bound for two-replication systems that provide nontrivial privacy guarantees, namely, the system is at least two-private. In addition, the maximum degree of a vertex in $G$ is denoted by $\delta$ .

Lemma 8.

In every two-private two-replication system the PIR rate is at most $\frac{\delta}{n}$ .

Proof.

Let $G$ be the induced graph, and let $\mu_{i}\geq 0$ be the fraction of $f$ which is downloaded from server $i$ by the user. Clearly, it must be that $\mu_{i}+\mu_{j}\geq 1$ for every edge $\{i,j\}\in E(G)$ , since otherwise, servers $i$ and $j$ can infer that their mutual file is not required by the user, and hence the system is not two-private. Further, the PIR rate of the system is $(\mathbbold{1}_{s}\cdot\boldsymbol{\mu}^{\top})^{-1}$ , where $\mathbbold{1}_{s}$ is the all $1$ ’s vector of length $s$ and $\boldsymbol{\mu}\triangleq(\mu_{1},\ldots,\mu_{s})$ . Hence, an upper bound on the PIR rate of the system is obtained from the optimal solution of the following linear program.

[TABLE]

That is, the inverse of the optimum value of the objective function serves as an upper bound on the PIR rate of the system. The following problem, which is called the dual of (2), $\boldsymbol{\eta}$ is a vector of $n$ variables.

[TABLE]

According to the primal-dual theory [7, Sec. 29.4], any solution which is feasible for (3) provides a lower bound for (2). It is readily verified that $\boldsymbol{\eta}=\frac{1}{\delta}\cdot\mathbbold{1}_{n}$ is a feasible solution for (3), and the objective function for this solution equals $n/\delta$ . Therefore, the PIR rate is bounded by $\delta/n$ . ∎

In cases where $G$ is a regular graph, which are particularly interesting since they induce systems with balanced storage, the resulting bound equals $\frac{\delta}{n}=\frac{2\delta}{s\delta}=2/s$ . However, the possibility of a considerable rate improvement in highly-unbalanced systems remains widely open.

IV Arbitrary replication factors

In this section we consider $r$ -replication systems for $r\geq 2$ , which are favored in practice due to their greater resilience to simultaneous failures [12, 5]. First, for any integer $r\geq 2$ , collusion resistance of $r-1$ can be attained by a simple scheme that is given in Subsection IV-A. Then, we provide another scheme in Subsection IV-B, which guarantees larger collusion resistance by a reduction to the $2$ -replication case. The collusion resistance in the latter case will strongly depend on our ability to increase the girth by removing edges from a certain multigraph. To simplify the discussion, in this section we alleviate the requirement that every two servers share at most one file in common.

IV-A Replication factor $r$ and collusion resistance $r-1$

The user begins by choosing a uniformly random matrix $V\in\mathbb{F}_{q}^{r\times n}$ , whose rows sum to $\textbf{e}_{\phi}$ , the $\phi$ ’th unit vector of length $n$ . Then, the user disperses the $nr$ symbols of the matrix $V$ to the queries $\{\textbf{q}_{i}\}_{i=1}^{s}$ arbitrarily666This is possible since $\sum_{i=1}^{s}|\textbf{q}_{i}|=\sum_{i=1}^{s}|\Gamma(i)|=rn$ , where $|\textbf{q}_{i}|$ is the length of $\textbf{q}_{i}$ ., such that every server that stores a file $\textbf{x}_{j}$ receives a unique entry from the $j$ ’th column of $V$ . In turn, the servers respond with the respective linear combinations $\{\textbf{a}_{j}=\textbf{q}_{j}\cdot X\}_{j=1}^{s}$ , and the user computes $\sum_{i=1}^{s}\textbf{a}_{i}=\textbf{e}_{\phi}\cdot X=\textbf{x}_{\phi}$ .

It is readily verified that every set of $r-1$ servers can observe at most $r-1$ entries in every column of $V$ , which appear entirely random, and hence the resulting scheme is $r-1$ private. Notice that there is no restriction on the number of files that can be stored in this system, nor there is a restriction on their dispersion.

IV-B Arbitrary replication factor by reduction

In systems where files might be stored in more than two servers, one can obtain perfect privacy by “ignoring” all but two copies of every file that is replicated more than twice, in a sense that will be made clear shortly, and applying the scheme in Section III. Observe that choosing which copies to ignore may drastically affect the collusion resistance of the system, since each choice produces a different graph with different cycles. Nevertheless, this observation can in fact contribute to the security of the system by concealing the cycle structure of the resulting graph from an adversary. In what follows we formalize these intuitions and discuss the different aspects of the reduction to the 2-replication scheme.

Evidently, it is natural to consider an $r$ -replication system for $r\geq 2$ (or in fact, any replication system) as a hypergraph, where each file corresponds to a hyperedge. Yet, for our purpose it is often more convenient to consider it as a colored multigraph. That is, instead of considering every file as a hyperedge, which is incident with the nodes that contain it, we consider a multigraph in which every edge carries a label (or a color) in $[n]$ . Then, two servers are connected by an edge with label $i\in[n]$ if both of them contain a copy of $\textbf{x}_{i}$ . Clearly, given a hypergraph $G$ , one can easily create the respective colored multigraph $\hat{G}$ by replacing hyperedge $i$ with a clique whose edges are labelled by $i$ . Notice that $\hat{G}$ can be a multigraph (i.e., contain parallel edges) since hyperedges can intersect in more than one node. An illustration of these definitions is given in Figure 1, which also demonstrates the natural notions of a monochromatic and polychromatic cycles, that will be useful in the sequel. In what follows we use $G$ and $\hat{G}$ interchangeably.

Given a replication system with a respective multigraph $\hat{G}$ , it is obvious that the user can choose any two copies of every file, and apply the scheme from Section III while ignoring the remaining copies. Formally, for a server $i$ that stores a copy of $\textbf{x}_{j}$ that is chosen to be ignored by the user, the user simply transmits a zero coefficient for $\textbf{x}_{j}$ , or omits that coefficient altogether. Further, the operation of ignoring all but two copies of every file corresponds to removing all but one of the edges of every color. Obviously, there are potentially many options to choose which edge to keep for every label, and every such choice can be described by a function $c:[n]\to E(\hat{G})$ such that the edge $c(i)$ is labelled by $i$ , for every $i\in[n]$ . For any such $c$ , let $\hat{G}_{c}$ be the result of keeping the edges $\{c(i)\}_{i\in[n]}$ , and removing the remaining ones. It is readily verified that the resulting scheme guarantees perfect privacy against colluding sets that do not contain a cycle in $\hat{G}_{c}$ .

Clearly, if one can choose the file dispersion in the system as one pleases, then it is possible to first choose the dispersion of only two copies of each file, so that the resulting graph $G^{\prime}$ has a certain girth. Then, the remaining copies can be dispersed arbitrarily, and the PIR scheme is performed with respect to the function $c$ that $c(i)\in E(G^{\prime})$ for every $i$ . However, if $\hat{G}$ is given to the user, finding a function $c$ such that $\hat{G}_{c}$ has a large girth requires more care.

For a given $\hat{G}$ one can choose $c$ at random. In spite of not having any clear minimum girth guarantee, this approach has the extra benefit of concealing the cycle structure from an adversary. For a given integer $g$ , a function $c$ such that $\hat{G}_{c}$ has girth $g$ , if exists, can be found be deciding the feasibility of the following $\{0,1\}$ -program. In this program, for $i\in[n]$ let $E_{i}$ be the set of all $2$ -subsets $\{a,b\}$ of $[s]$ such that there exists an edge $\{a,b\}$ labelled by $i$ .

•

Objective: None.

•

Variables: $\{x_{i,\{a,b\}}~{}|~{}i\in[n]\mbox{ and }\{a,b\}\in E_{i}\}$ .

•

Constraints:

–

$\sum_{\{a,b\}\in E_{i}}x_{i,\{a,b\}}=1$ for all $i\in[n]$ .

–

$\sum_{i|\{a,b\}\in E_{i}}x_{i,\{a,b\}}\leq 1$ for every $\{a,b\}$ such that there exists at least one edge $\{a,b\}$ in $\hat{G}$ .

–

$\sum_{i|\{a,b\}\in E_{i}}x_{i,\{a,b\}}+\sum_{i|\{b,c\}\in E_{i}}x_{i,\{b,c\}}+\sum_{i|\{c,a\}\in E_{i}}x_{i,\{c,a\}}\leq 2$ , for every $a,b,c\in[s]$ that contain at least one triangle in $\hat{G}$ .

$\vdots$

–

$\sum_{j=1}^{g}\sum_{i|\{a_{j},a_{(j+1)\bmod g}\in E_{i}\}}x_{i,\{a_{j},a_{(j+1)\bmod g}\}}\leq g-1$ for every $a_{0},\ldots,a_{g-1}\in[s]$ that contain at least one $g$ -cycle in $\hat{G}$ .

Clearly, the first set of constraints guarantees that exactly one edge is chosen for every file $i\in[n]$ . The second set of constraints guarantees that the resulting choice does not contain $2$ -cycles, the next set guarantees that there are no triangles, and so on. Finally, we note that while solving this system for a general $g$ is NP-hard, the special case $g=2$ reduces to finding a maximum matching in a bipartite graph, a problem that can be solved efficiently.

V Graph-based coding – Reducing the storage overhead at improved PIR rates

This section discusses storage systems in which every file is similarly stored on a small number of servers, but replication is generalized to arbitrary encoding. Hence, when employing an $[N,K]_{q}$ code with rate larger than $1/2$ (i.e., $K/N>1/2)$ , we obtain an improvement over previous schemes in terms of storage overhead. Furthermore, it is shown that the resulting PIR rate is improved whenever $N-K>1$ . However, the (coded) file dispersion must follow a certain structure, and the resulting collusion patterns are in correspondence with polychromatic cycles (see Subsection IV-B and Figure 1), as will be explained next. Finally, we note that the scheme in this section is loosely inspired by ideas from [11] and [14].

Essentially, in the scheme of Section III, every file $\textbf{x}_{i}$ is coded by using a repetition code of length $2$ over the alphabet $\mathbb{F}_{q}^{f}$ . Then, every symbol of the resulting codeword is stored on a different server. The scheme which is presented in this section generalizes this concept by employing codes other than the repetition code.

For integers $N$ and $K$ let $G\in\mathbb{F}_{q}^{K\times N}$ be a generator matrix of an $[N,K]_{q}$ MDS code $\mathcal{D}$ . Consider every file $\textbf{x}_{i}$ as an $(f/K)\times K$ matrix $(\textbf{x}_{i,1}^{\top},\ldots,\textbf{x}_{i,K}^{\top})$ over $\mathbb{F}_{q}$ , and let $(\textbf{x}_{i,1}^{\top},\ldots,\textbf{x}_{i,K}^{\top})\cdot G\triangleq(\textbf{y}_{i,1}^{\top},\ldots,\textbf{y}_{i,N}^{\top})$ , where the vectors $\{\textbf{y}_{i,j}\}_{j=1}^{N}$ are called the codeword symbols of $\textbf{x}_{i}$ . Let $\mathcal{L}_{1},\ldots,\mathcal{L}_{N}\subseteq[s]$ be disjoint nonempty subsets whose union is $[s]$ (and hence we must have $N\leq s$ ). Then, for every $i\in[n]$ , disperse the $N$ codeword symbols $\textbf{y}_{i,1},\ldots,\textbf{y}_{i,N}$ to the servers such that for every $j\in[N]$ , the codeword symbol $\textbf{y}_{i,j}$ is in exactly one server which belong to $\mathcal{L}_{j}$ . For example, one can think of a system in which the servers are partitioned to three disjoint subsets; the servers in the first subset contain the first halves of all files, the servers in the second contain the other half, and the servers in the third contain the sums of the two halves (see Example 11 and Example 12 which follow).

The above coding scheme gives rise to an $N$ -uniform $N$ -partite hypergraph in the following manner. Let $[s]$ be the set of vertices, and define hyperedges $e_{1},\ldots,e_{n}$ , such that $e_{i}$ contains all servers that store either one of $\textbf{y}_{i,1},\ldots,\textbf{y}_{i,N}$ . It is evident that the edges are of size $N$ , and that the $N$ parts of the hypergraph are the sets $\mathcal{L}_{1},\ldots,\mathcal{L}_{N}$ . Let $G$ be this hypergraph, and let $\hat{G}$ be its respective colored multigraph, as described in Subsection IV-B.

We begin by presenting the PIR protocol for the special case $N-K=K$ , and later extend it to other parameters by operating in rounds. Begin by choosing $\boldsymbol{\alpha}\in(\mathbb{F}_{q}^{*})^{n},\boldsymbol{\gamma}\in(\mathbb{F}_{q}^{*})^{s}$ , and $h\in\mathbb{F}_{q}\setminus\{0,1\}$ uniformly at random, and pick an arbitrary subset $\mathcal{K}\subseteq[N]$ of size $K$ . Then, for every $m\in[N]$ , a server $j\in[s]$ which belongs to $\mathcal{L}_{m}$ receives the following query.

[TABLE]

where $\delta(t,m)$ is a Boolean indicator for the event “ $m\in\mathcal{K}$ and $t=\phi$ ”. Namely, the user transmits to server $j$ the part of the vector $\gamma_{j}\cdot\boldsymbol{\alpha}$ that is relevant to it, where arbitrary $K$ servers that store a codeword symbol of $\textbf{x}_{\phi}$ are having the $\phi$ ’th entry of $\gamma_{j}\cdot\boldsymbol{\alpha}$ multiplied by $h$ . In turn, a server $j$ in $\mathcal{L}_{m}$ , which stores $\{\textbf{y}_{\ell,m}|\ell\in\mathcal{L}\}$ for some $\mathcal{L}\subseteq[n]$ , responds with $\textbf{a}_{j}\triangleq\sum_{\ell\in\mathcal{L}}(\textbf{q}_{j})_{\ell}\cdot\textbf{y}_{\ell,m}$ . Having the responses $\{\textbf{a}_{i}\}_{i=1}^{s}$ , the user composes the following matrix.

[TABLE]

where for $m\in[N]$ , the $m$ ’th column of e is

[TABLE]

Now, it is evident that every row in the matrix $Y$ is a codeword in $\mathcal{D}$ , whose minimum distance is $N-K+1$ . Therefore, since e has at most $K$ nonzero columns, and since $K=N-K$ , a decoding algorithm777Notice that the “error values” are in prescribed positions, and hence, an erasure correction algorithm suffices. for $\mathcal{D}$ can extract e from the matrix that was composed by the user. At this point the user has obtained $\{\textbf{y}_{\phi,m}\}_{m\in\mathcal{K}}$ , that are sufficiently many codeword symbols of $\textbf{x}_{\phi}$ in order to retrieve it. Therefore, the PIR rate of this scheme is $\frac{f}{s\cdot(f/K)}=\frac{K}{s}=\frac{N-K}{s}$ . The proof of privacy will be given after the general description.

Notice that in the above scheme, $N-K$ codeword symbols of $\textbf{x}_{\phi}$ are obtained, while $K$ many of those are sufficient to retrieve $\textbf{x}_{\phi}$ . However, in cases where $N-K<K$ , the scheme will not be successful, and in cases where $N-K>K$ , the resulting scheme will not be exploited to its full potential.

Therefore, to address cases in which $K\neq N-K$ , we retrieve multiple files in rounds, a standard practice in the PIR literature (e.g., [11, 14]). That is, we assume that the user wishes to download $\textbf{x}_{\phi_{1}},\ldots,\textbf{x}_{\phi_{b}}$ privately for some $b\geq 1$ , and the protocol operates in $r\geq 1$ rounds. In each round, the user sends a query to every server, and receives responses from all servers. Specifically, we choose $b$ and $r$ so that $Kb=r(N-K)$ , i.e., $r\triangleq\frac{LCM(K,N-K)}{N-K}$ and $b\triangleq\frac{LCM(K,N-K)}{K}$ . Prior to executing these rounds, the user fixes the following subsets of $[N]$

[TABLE]

such that in every row, the sets in the union are pairwise disjoint, such that $|J^{(i)}|=N-K$ for every $i\in[r]$ , and such that $|\cup_{i=1}^{s}J^{(i,j)}|=K$ for every $j\in[b]$ . Intuitively, for $j\in[b]$ and $i\in[r]$ , the set $J^{(j,i)}$ contains the indices of the codeword symbols of $\textbf{x}_{\phi_{j}}$ that are retrieved during round $i$ . The choice of such sets is easy, and is illustrated in Appendix B.

In each round $i$ the user executes the aforementioned protocol (for the case $K=N-K$ ), where $J^{(i)}$ is used in lieu of the set $\mathcal{K}$ . That is, the queries are defined as in (4), with the difference that $\delta(t,m)$ is a Boolean indicator for the event “there exists $j\in[b]$ such that $t=\phi_{j}$ and $m\in J^{(i,j)}$ ”. Having obtained the responses from all servers in round $i$ , the user computes

[TABLE]

where for $m\in[N]$ , the $m$ ’th column of $\textbf{e}^{\prime}$ is

[TABLE]

Since $|J^{(i)}|=N-K$ , a decoding algorithm on the matrix $Y$ can extract the values of $\textbf{e}^{\prime}$ . Hence, according to the structures of the sets in (V), it follows that by the end of the $r$ ’th round, the user has obtained the $K$ codeword symbols $\{\textbf{y}_{\phi_{j},m}\}_{m\in\cup_{i}J^{(i,j)}}$ of $\textbf{x}_{\phi_{j}}$ for every $j\in[b]$ , and hence all the files $\{\textbf{x}_{\phi_{j}}\}_{j=1}^{b}$ can be retrieved. The resulting PIR rate is

[TABLE]

Remark 9.

Roughly speaking, the scheme which is described in Section III is as a special case of the one in this section, where $K=1$ , $N=2$ , and $\mathcal{D}\triangleq\{(x,-x)|x\in\mathbb{F}_{q}\}$ , and the resulting rate is indeed $\frac{N-K}{s}=\frac{1}{s}$ . However, further simplification is possible for this particular choice of $\mathcal{D}$ , since the process of extracting the error vector e reduces to multiplying by $\mathbbold{1}$ from the left. Hence, the partitioning of the servers to subsets $\{\mathcal{L}_{j}\}_{j=1}^{N}$ is not required.

Proposition 10.

A set $\mathcal{S}\subseteq V$ that contains no polychromatic cycles in $\hat{G}$ gains no information about $\phi_{1},\ldots,\phi_{b}$ .

Proof.

For $\mathcal{S}$ that does not contain a polychromatic cycle, let $\mathcal{R}\subseteq[n]$ be the set of hyperedges in $G$ that have two or more vertices in $\mathcal{S}$ . Similar to Proposition 1, we analyze the matrix which is chosen according to the random variable $Q_{\mathcal{S},\mathcal{R}}$ . Clearly, every matrix which is chosen according to $Q_{\mathcal{S},\mathcal{R}}$ is $(\mathcal{S},\mathcal{R})$ -compatible with $G$ , and we show that the inverse is also true.

Let $M\in\mathbb{F}_{q}^{|\mathcal{S}|\times|\mathcal{R}|}$ be a matrix which is $(\mathcal{S},\mathcal{R})$ -compatible with $G$ . Fix some $v_{i}\in\mathcal{S}$ as the starting point of the BFS algorithm, and choose an arbitrary value for $\gamma_{i}$ (with probability $1$ ). Once $\gamma_{i}$ is fixed, it is evident that $\Pr(\gamma_{i}\cdot\alpha_{j}\cdot h^{\delta}=M_{i,j})=(q-1)^{-1}$ for every hyperedge $e_{j}$ that is incident with $v_{i}$ regardless of the value of the Boolean indicator $\delta$ . Notice that the only mutual element of these hyperedges is $v_{i}$ , since otherwise, a polychromatic cycle of length two would exist in $\hat{G}$ . Therefore, once $\alpha_{j}$ is fixed for such a hyperedge $e_{j}$ , we have that $\Pr(\gamma_{\ell}\cdot\alpha_{j}\cdot h^{\delta}=M_{\ell,j})=(q-1)^{-1}$ for every $\ell$ such that $v_{\ell}\in e_{j}\cap\mathcal{R}$ , again, regardless of $\delta$ . Proceeding in a BFS fashion, we have that each node-hyperedge incidence reduces the overall probability of obtaining $M$ by a multiplicative factor of $(q-1)^{-1}$ . Since $\mathcal{S}$ does not contain a polychromatic cycle, no discrepancy is encountered, which concludes the proof. ∎

Example 11.

Consider $s=12$ , and let $\mathcal{D}$ be the parity code $\{(x,y,x+y)|x,y\in\mathbb{F}_{q}\}$ , and hence $N=3$ and $K=2$ . Also, let $\mathcal{L}_{1}=\{1,\ldots,4\}$ , $\mathcal{L}_{2}=\{5,\ldots,8\}$ , and $\mathcal{L}_{3}=\{9,\ldots,12\}$ . Consider the following $16$ hyperedges.

[TABLE]

It is readily verified that every two distinct edges intersect in at most one node, and hence, there are no polychromatic cycles of length $2$ . The resulting system is $2$ -private, has storage overhead $1.5$ , and its PIR rate is $1/12$ .

Example 12.

Generalizing the previous example, let $s$ be any integer divisible by $3$ , let $\mathcal{D}$ be the parity code, and let $\mathcal{L}_{1}=\{1,\ldots,s/3\}$ , $\mathcal{L}_{2}=\{s/3+1,\ldots,2s/3\}$ , and $\mathcal{L}_{3}=\{2s/3+1,\ldots,s\}$ . Let $\mathcal{M}_{1},\ldots,\mathcal{M}_{s/3}$ be edge-disjoint maximum matchings888Recall that a matching is a subset of disjoint edges. A maximal matching is a matching such that any edges that is added to it violates the disjointness of its edges. A maximum matching is a matching of the largest possible cardinality. It is readily verified that a complete bipartite graph $K_{m,m}$ contains $m$ disjoint maximum matchings. in a complete bipartite graph $H$ whose one side is $\mathcal{L}_{2}$ , and the other is $\mathcal{L}_{3}$ . Notice that $|\mathcal{M}_{i}|=s/3$ for every $i$ , and consider the following hyperedges.

[TABLE]

We claim that any two of the above hyperedges intersect in at most one node. Assuming otherwise we have $|\{a_{1},a_{2},a_{3}\}\cap\{b_{1},b_{2},b_{3}\}|=2$ for some integers $a_{i}$ and $b_{i}$ . If $a_{1}=b_{1}$ , it follows that the edges $\{a_{2},a_{3}\}$ and $\{b_{2},b_{3}\}$ in $H$ share a vertex, even though they both belong to $\mathcal{M}_{a_{1}}$ , a contradiction. If $a_{1}\neq b_{1}$ , it follows that the matchings $\mathcal{M}_{a_{1}}$ and $\mathcal{M}_{b_{1}}$ both contain the edge $\{a_{2},a_{3}\}=\{b_{2},b_{3}\}$ , another contradiction.

Therefore, the resulting system is $2$ -private, accommodates $n=s^{2}/9$ files, incurs storage overhead of $1.5$ , and has PIR rate of $1/s$ . For comparison, considering the full graph on $s$ nodes and applying the scheme in Section III provides a $2$ -private system with $n=(s^{2}+s)/2$ files, storage overhead $2$ , and comparable PIR rate $1/s$ .

VI Discussion and open questions

In this paper we initiated a study of private information retrieval for a specific storage model that is widely used in practice, and widely studied in theoretical research. In order to improve our understanding of this model, and in order to improve its applicability to real-world systems, we suggest the following research directions.

Close the gap between achievable PIR rate in Subsection III-A and the upper bound in Subsection III-B. 2. 2.

Improve the collusion resilience in systems with arbitrary replication factors. 3. 3.

Construct families of dense graphs in which $\mathcal{T}(\mathcal{S},\phi)$ (1) is large for every $\mathcal{S}\subseteq[s]$ and every $\phi$ . 4. 4.

Study graceful degradation for replication factors larger than two. 5. 5.

Find PIR schemes for $2$ -replication systems that guarantee collusion resistance against cycles, and are nontrivial (i.e., download less than the entire dataset).

Acknowledgments

The work of Itzhak Tamo was supported in part by Israel Science Foundation (ISF) Grant 1030/15 and NSF-BSF Grant 2015814. The work of Eitan Yaakobi was supported in part by Israel Science Foundation (ISF) grant 1817/18. The work of Netanel Raviv was supported in part by the postdoctoral fellowship of the Center for the Mathematics of Information (CMI) in the California Institute of Technology.

Appendix A Proof of the main theorem

The proof of Theorem 4 requires two auxiliary lemmas (Lemma 13 and Lemma 14), and then is proved in two parts (Lemma 15 and Lemma 16).

Lemma 13.

Let $C\subseteq G$ be a cycle with $c$ edges, and let $M\in\mathbb{F}_{q}^{c\times(c-1)}$ be a matrix which is $(V(C),E(C)\setminus\{j\})$ -compatible, where $j$ is the maximum index of an edge in $E(C)$ . Then, there exist precisely $q-1$ vectors $\textbf{a}\in\mathbb{F}_{q}^{c}$ such that $M^{\prime}\triangleq(M|\textbf{a})\in\mathbb{F}_{q}^{c\times c}$ is $(V(C),E(C))$ -compatible and $\operatorname{rank}(M^{\prime})=c-1$ .

Proof.

First, observe that since $C\setminus\{j\}$ is a tree, and since $M$ is $(V(C),E(C)\setminus\{j\})$ -compatible with $G$ , it follows that $\operatorname{rank}M=c-1$ . Hence, the added vector a must be in $\operatorname{colspan}(M)$ , i.e.,

[TABLE]

where the $\textbf{c}_{k}$ ’s are the columns of $M$ and the $m_{k}$ ’s are coefficients from $\mathbb{F}_{q}$ . Furthermore, since $M^{\prime}$ must be compatible with $G$ , the column a must contain nonzero entries precisely in row $i_{1}$ and row $i_{2}$ , that correspond to the two vertices incident with edge $j$ . Hence, since each row $k\in V(C)\setminus\{i_{1},i_{2}\}$ of $M$ contains precisely two nonzero entries in some columns $k_{1}$ and $k_{2}$ , it follows that intersecting the column span of $M$ with $N_{k}\triangleq\{\textbf{x}=(x_{i})_{i=1}^{c}\in\mathbb{F}_{q}^{c}|x_{k}=0\}$ reduces the degrees of freedom in (6) by $1$ , since it renders any one of $\{m_{k_{1}},m_{k_{2}}\}$ to be a linear function of the other. Therefore,

[TABLE]

Since any nonzero vector in $X$ is a suitable candidate for a, the claim follows. ∎

Lemma 14.

If an edge $e\in E(G)$ is on a cycle in $G$ , then there exists a BFS ordering of $E(G)$ for which $e$ is a back edge.

Proof.

Denote $e_{\phi}=\{v_{f},v_{g}\}$ and choose $v_{d}\in V(G)$ which maximizes $\operatorname{dist}(v_{g},v_{d})$ , where distance between two vertices is defined as the number of edges in the shortest path between them. Without loss of generality, assume that $\operatorname{dist}(v_{g},v_{d})\geq\operatorname{dist}(v_{f},v_{d})$ , and consider a BFS run which begins at $v_{d}$ . Partition $V(G)$ to layers $L_{1},L_{2},\ldots$ according to their distance from $v_{d}$ , and recall that edges inside each layer are always back edges. Hence, if $e_{\phi}$ is inside a layer, we are done. Otherwise, assume that $v_{f}$ is in $L_{i}$ for some $i$ , and hence $v_{g}$ is in $L_{i+1}$ . Since $e_{\phi}$ is on a cycle, there exists another edge $e^{\prime}$ from a node $v^{\prime}\in L_{i}$ to $v_{g}$ . Hence, in cases where $v^{\prime}$ pops out of the queue before $v_{f}$ , $e_{\phi}$ will indeed be a back edge. It is readily verified that the order of insertion of discovered vertices in the same layer is arbitrary, and hence there exists a BFS run in which $v^{\prime}$ predates $v_{f}$ , and the claim follows. ∎

We now turn to prove Theorem 4 in two parts.

Lemma 15.

For every subgraph $T\subseteq G$ , the support of the random variable $Q^{T}|\phi$ is the set of all matrices $A\in\mathbb{F}_{q}^{|V(T)|\times|E(T)|}$ such that:

(a)

$A$ * is $T$ -compatible with $G$ ; and*

(b)

for every cycle $C\subseteq T$

[TABLE]

Proof.

For simplicity assume that $2|q$ , but other cases can be proved similarly. By the definition of $Q^{T}|\phi$ , it is evident that (a) is necessary, and according to Proposition 2, it follows that (b) is necessary. In what follows, it is shown that (a) and (b) are also sufficient. To this end, let $A\in\mathbb{F}_{q}^{|V(T)|\times|E(T)|}$ be a matrix which satisfies (a) and (b), and it is shown that there exists a choice of $\boldsymbol{\alpha},\boldsymbol{\gamma},$ and $h$ for which $Q^{T}|\phi$ produces $A$ .

Consider a BFS run on $T$ , and number $V(T)$ and $E(T)$ according to their discovery times. That is, let $v_{1},\ldots,v_{|V(T)|}$ be the vertices of $T$ sorted by their discovery times, and let $e_{1},\ldots,e_{|E(T)|}$ be the edges of $T$ sorted by their discovery times. Also, assume that if $e_{\phi}\in E(T)$ , and $e_{\phi}$ closes a cycle, then it is a back edge (see Lemma 14). The values of $\boldsymbol{\alpha},\boldsymbol{\gamma}$ , and $h$ which produce $A$ are determined according to this BFS ordering, as follows.

First, fix an arbitrary value in $\mathbb{F}_{q}^{*}$ for $\gamma_{1}$ . Then, since $v_{1}$ is incident with the edges $e_{1},\ldots,e_{|\Gamma(v_{1})|}$ , we fix the values of $\alpha_{1},\ldots,\alpha_{|\Gamma(v_{1})|}$ as $\alpha_{i}\triangleq A_{v_{1},e_{i}}/\gamma_{1},i\in\{1,\ldots,|\Gamma(v_{1})|\}$ . Then, for $v_{2},\ldots,v_{|\Gamma(v_{1})|+1}$ , that are the end vertices of $e_{1},\ldots,e_{|\Gamma(v_{1})|}$ , respectively, we fix $\gamma_{i}=A_{v_{i},e_{i-1}}/\alpha_{i-1},i\in\{2,\ldots,|\Gamma(v_{1})|+1\}$ . If $e_{\phi}$ is not on a cycle in $T$ , and $e_{\phi}$ happens to be, say, $e_{1}$ , then we can obviously choose $\alpha_{2}\triangleq A_{v_{2},e_{1}}/(\gamma_{1}\cdot h)$ , where $h$ is arbitrary (the case where $e_{\phi}$ lies on a cycle is treated in the sequel). Clearly, this process goes on unhindered as long as a back edge is not discovered.

Once a back edge $e_{b}=\{v_{c},v_{d}\},b\neq\phi$ is discovered, we have that $\gamma_{c},\gamma_{d}$ were already determined in earlier stages of the algorithm. Hence, we ought to show that there exists $\alpha_{b}$ for which

[TABLE]

To this end, let $C$ be a cycle which is discovered in whole when $e_{b}$ is discovered and let $c$ be its number of edges. Further, let $M\triangleq A^{C\setminus\{e_{b}\}}$ , i.e., the partial matrix of $A$ which corresponds to the subgraph $C\setminus\{e_{b}\}$ . Similarly, let $N\triangleq\operatorname{diag}(\boldsymbol{\gamma}_{V(C)})I^{C\setminus\{e_{b}\}}\operatorname{diag}(\boldsymbol{\alpha}_{E(C)\setminus\{e_{b}\}})$ be the matrix which corresponds to the choice of entries in $\boldsymbol{\gamma}$ and $\boldsymbol{\alpha}$ up until $e_{b}$ is discovered. By the correctness of the algorithm so far, it follows that $M=N$ . Moreover, both $M$ and $N$ are $(V(C),E(C)\setminus\{j\})$ -compatible, and by the definition of $A$ , the submatrix $A^{C}$ is $C$ -compatible, and its rank is $c-1$ . According to Lemma 13 there exist precisely $(q-1)$ columns $\textbf{c}_{1},\ldots,\textbf{c}_{q-1}$ that extend $M$ (and also $N$ ) to a $C$ -compatible matrix of rank $c-1$ , one of which is $A^{C}$ . Further, it is evident that the matrix $\operatorname{diag}(\boldsymbol{\gamma}_{V(C)})I^{C}\operatorname{diag}(\boldsymbol{\alpha}_{E(C)})$ , for any of the $(q-1)$ possible values of $\alpha_{b}\in\mathbb{F}_{q}^{*}$ , results in a $C$ -compatible matrix of rank $c-1$ as well. Therefore, there exists a 1-1 correspondence between the possible values of $\alpha_{b}$ and $\textbf{c}_{1},\ldots,\textbf{c}_{q-1}$ . Since one of $\textbf{c}_{1},\ldots,\textbf{c}_{q-1}$ is the actual $e_{b}$ ’th column of $A^{C}$ , it follows that there exists a unique value of $\alpha_{b}\in\mathbb{F}_{q}^{*}$ which satisfies (7).

If $e_{\phi}$ lies on a cycle $C^{\prime}$ in $T$ , we denote $e_{\phi}\triangleq\{v_{f},v_{g}\}$ . Since $e_{\phi}$ is a back edge, we have that $\gamma_{g}$ and $\gamma_{f}$ were determined in earlier steps of the algorithm. Hence, we must find $\alpha_{\phi}\in\mathbb{F}_{q}^{*}$ and $h\in\mathbb{F}_{q}\setminus\{0,1\}$ for which

[TABLE]

Clearly, the choice $\alpha_{\phi}\triangleq A_{v_{f},e_{\phi}}/\gamma_{f}$ satisfies (9), and consequently, $h\triangleq\frac{A_{v_{g},e_{\phi}}}{\gamma_{g}\alpha_{\phi}}$ satisfies (8). We are only left to show that this value for $h$ is neither [math] nor $1$ . First, it is obviously nonzero as a product of nonzero terms. Second, if $h=1$ happens to be the answer, we have by Proposition 2 that $A^{C^{\prime}}$ is rank-deficient, in contradiction with condition (b). ∎

Lemma 16.

For every $T\subseteq G$ , the random variable $Q^{T}|\phi$ is uniformly distributed on its support.

Proof.

Let $A$ be a matrix in the support of $Q^{T}|\phi$ . By following the proof of Lemma 15, we have that once $\gamma_{1}$ is fixed, and as long as a back edge is not discovered, every edge-node incidence reduces the overall probability of obtaining $A$ by $(q-1)^{-1}$ . In addition, every back edge which is not $e_{\phi}$ reduces the probability of obtaining $A$ by $(q-1)^{-1}$ due to (7), instead of by $(q-1)^{-2}$ for tree edges999An edge which is not a back edge in a BFS ordering is called a tree edge.. Finally, if $e_{\phi}$ lies on a cycle, it reduces the overall probability by $\frac{1}{q-1}$ due to (9) and by $\frac{1}{q-2}$ due to (8). Therefore, we have the following, where $u$ denotes the number of edge-node incidences in $T$ , and $k$ denotes the number of back edges in a BFS run (which is identical in every run of a BFS algorithm).

•

If $e_{\phi}$ is not on a cycle in $T$ then $\Pr((Q^{T}|\phi)=A)=\left(\frac{1}{q-1}\right)^{u-k}$ .

•

If $e_{\phi}$ is on a cycle in $T$ then $\Pr((Q^{T}|\phi)=A)=\left(\frac{1}{q-1}\right)^{u-k}\cdot\frac{1}{q-2}$ .∎

Appendix B Choice of sets

The process of choosing the sets $\{J^{(j,i)}\}_{(j,i)\in[r]\times[b]}$ in (V) is very simple, and is best illustrated by the following examples.

Example 17.

Assume that $N-K=4$ and $K=6$ , which implies that $r=3$ and $b=2$ . Consider the following matrix

[TABLE]

which naturally corresponds to the sets

[TABLE]

As another example, in which $N-K\geq K$ , we may consider the following.

Example 18.

Assume that $N-K=6$ and $K=4$ , which implies that $r=2$ and $b=3$ . Consider the following matrix

[TABLE]

which naturally corresponds to the sets

[TABLE]

Bibliography30

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] K. Banawan and S. Ulukus, “The capacity of private information retrieval from coded databases,” ar Xiv:1609.08138 [cs.IT], 2016.
2[2] K. Banawan and S. Ulukus, “Multi-message private information retrieval: Capacity results and near-optimal schemes,” IEEE Transactions on Information Theory , 2018.
3[3] S. Blackburn and T. Etzion, “PIR array codes with optimal PIR rate,” ar Xiv:1607.00235 [cs.IT], 2016.
4[4] S. Blackburn, T. Etzion, and M. B. Paterson, “PIR schemes with small download complexity and low storage requirements,” ar Xiv:1609.07027 [cs.IT], 2016.
5[5] Apache Cassandra TM 2.1 for DSE, Data replication, https://docs.datastax.com/en/cassandra/2.1/cassandra/architecture/architecture Data Distribute Replication_c.html .
6[6] B. Chor, O. Goldreich, E. Kushilevitz, and M. Sudan, “Private information retrieval,” IEEE 36th Annual Symposium on Foundations of Computer Science (FOCS), pp. 41–50, 1995.
7[7] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to algorithms. MIT press, 2009.
8[8] Z. Dvir and S. Gopi, “2 server PIR with sub-polynomial communication,” Forty-Seventh Annual ACM on Symposium on Theory of Computing (STOC), pp. 577–584, 2015.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Abstract

I Introduction

II Preliminaries

II-A Previous work

III Replication factor two

III-A A PIR protocol for 2-replication systems

Proposition 1**.**

Proof sketch.

Proposition 2**.**

Proof.

Corollary 3**.**

Proof.

Theorem 4**.**

Corollary 5**.**

Proof.

Example 6**.**

Remark 7**.**

III-B Bound

Lemma 8**.**

Proof.

IV Arbitrary replication factors

IV-A Replication factor rrr and collusion resistance r−1r-1r−1

IV-B Arbitrary replication factor by reduction

V Graph-based coding – Reducing the storage overhead at improved PIR rates

Remark 9**.**

Proposition 10**.**

Proof.

Example 11**.**

Example 12**.**

VI Discussion and open questions

Acknowledgments

Appendix A Proof of the main theorem

Lemma 13**.**

Proof.

Lemma 14**.**

Proof.

Lemma 15**.**

Proof.

Lemma 16**.**

Proof.

Appendix B Choice of sets

Example 17**.**

Example 18**.**

Proposition 1.

Proposition 2.

Corollary 3.

Theorem 4.

Corollary 5.

Example 6.

Remark 7.

Lemma 8.

IV-A Replication factor $r$ and collusion resistance $r-1$

Remark 9.

Proposition 10.

Example 11.

Example 12.

Lemma 13.

Lemma 14.

Lemma 15.

Lemma 16.

Example 17.

Example 18.