Kelly Cache Networks

Milad Mahdian; Armin Moharrer; Stratis Ioannidis; Edmund Yeh

arXiv:1901.04092·cs.NI·January 30, 2019

Kelly Cache Networks

Milad Mahdian, Armin Moharrer, Stratis Ioannidis, Edmund Yeh

PDF

TL;DR

This paper investigates cache placement in queueing networks to optimize objectives like delay and congestion, proposing efficient algorithms and extending results to various queue types.

Contribution

It formulates cache placement as a submodular maximization problem and introduces a fast continuous greedy algorithm with near-optimal approximation guarantees.

Findings

01

The optimization problem is NP-hard but approximable.

02

The continuous greedy algorithm achieves near 63% of optimal.

03

Results extend to M/M/k and M/D/1 queue networks.

Abstract

We study networks of M/M/1 queues in which nodes act as caches that store objects. Exogenous requests for objects are routed towards nodes that store them; as a result, object traffic in the network is determined not only by demand but, crucially, by where objects are cached. We determine how to place objects in caches to attain a certain design objective, such as, e.g., minimizing network congestion or retrieval delays. We show that for a broad class of objectives, including minimizing both the expected network delay and the sum of network queue lengths, this optimization problem can be cast as an NP- hard submodular maximization problem. We show that so-called continuous greedy algorithm attains a ratio arbitrarily close to $1 - 1/ e \approx 0.63$ using a deterministic estimation via a power series; this drastically reduces execution time over prior art, which resorts to sampling.…

Tables3

Table 1. TABLE I: Notation Summary

Kelly Cache Networks
$G (V, E)$	Network graph, with nodes $V$ and edges $E$
$k_{p} (v)$	position of node $v$ in path $p$
$μ_{(u, v)}$	Service rate of edge $(u, v) \in E$
$ℛ$	Set of classes/types of requests
$λ^{r}$	Arrival rate of class $r \in ℛ$
$p^{r}$	Path followed by class $r \in ℛ$
$i^{r}$	Object requested by class $r \in ℛ$
$𝒞$	Item catalog
$𝒮_{i}$	Set of designated servers of $i \in 𝒞$
$c_{v}$	Cache capacity at node $v \in V$
$x_{v i}$	Variable indicating whether $v \in V$ stores $i \in 𝒞$
$𝐱$	Placement vector of $x_{v i}$ s, in ${0, 1}^{\| V \| \| 𝒞 \|}$
$𝝀$	Vector of arrival rates $λ^{r}$ , $r \in ℛ$
$λ_{e}^{r}$	Arrival rate of class $r$ responses over edge $e \in E$
$ρ_{e}$	Load on edge $e \in E$
$Ω$	State space
$𝐧$	Global state vector in $Ω$
$π (𝐧)$	Steady-state distribution of $𝐧 \in Ω$
$𝐧_{e}$	State vector of queue at edge $e \in E$
$π_{e} (𝐧_{e})$	Marginal of steady-state distribution of queue $𝐧_{e}$
$n_{e}$	Size of queue at edge $e \in E$
Cache Optimization
$C$	Global Cost function
$C_{e}$	Cost function of edge $e \in E$
$𝒟$	Set of placements $𝐱$ satisfying capacity constraints
$𝐱_{0}$	A feasible placement in $𝒟$
$F (𝐱)$	Caching gain of placement $𝐱$ over $𝐱_{0}$
$y_{v i}$	Probability that $v \in V$ stores $i \in 𝒞$
$𝐲$	Vector of marginal probabilities $y_{v i}$ , in ${0, 1}^{\| V \| \| 𝒞 \|}$
$G (𝐲)$	Multilinear extension under marginals $𝐲$
$𝒟_{𝝀}$	Set of placements under which system is stable under arrivals $𝝀$
$\tilde{𝒟}$	Convex hull of constraints of MaxCG
Conventions
$𝚜𝚞𝚙𝚙 (\cdot)$	Support of a vector
$𝚌𝚘𝚗𝚟 (\cdot)$	Convex hull of a set
${[𝐱]}_{+ i}$	Vector equal to $𝐱$ with $i$ -th coordinate set to 1
${[𝐱]}_{- i}$	Vector equal to $𝐱$ with $i$ -th coordinate set to 0
$𝟎$	Vector of zeros

Table 2. TABLE II: Graph Topologies and Experiment Parameters.

Graph	$\| V \|$	$\| E \|$	$\| 𝒞 \|$	$\| ℛ \|$	$\| Q \|$	$c_{v}$	$F_{PL} (𝐱_{RND})$	$F_{UNI} (𝐱_{RND})$
ER	100	1042	300	1K	4	3	2.75	2.98
ER-20Q	100	1042	300	1K	20	3	3.1	2.88
HC	128	896	300	1K	4	3	2.25	5.23
HC-20Q	128	896	300	1K	20	3	2.52	5.99
star	100	198	300	1K	4	3	6.08	8.3
path	4	3	2	2	1	1	1.2	1.2
dtelekom	68	546	300	1K	4	3	2.57	3.66
abilene	11	28	4	2	2	1/2	4.39	4.39
geant	22	66	10	100	4	2	19.68	17.22

Table 3. TABLE III: Results of ρ u , v ( 𝐱 ) subscript 𝜌 𝑢 𝑣 𝐱 \rho_{u,v}(\mathbf{x}) ’s for different caching configurations.

$[x_{11}, x_{21}]$	$ρ_{3, 2}$	$ρ_{2, 1}$
$[0, 0]$	$\frac{λ}{μ_{3, 2}}$	$\frac{λ (1 - p_{3, 2}^{L})}{μ_{2, 1}}$
$[1, 0]$	0	0
$[0, 1]$	0	$\frac{λ}{μ_{2, 1}}$
$[1, 1]$	0	0

Equations164

D = {x \in {0, 1}^{∣ V ∣∣ C ∣} : \sum_{i \in C} x_{v i} \leq c_{v}, \forall v \in V},

D = {x \in {0, 1}^{∣ V ∣∣ C ∣} : \sum_{i \in C} x_{v i} \leq c_{v}, \forall v \in V},

λ_{(u, v)}^{r} (x, λ) = λ^{r} k^{'} = 1 \prod k_{p^{r}} (v) (1 - x_{p_{k^{'}}^{r} i^{r}}), for (v, u) \in p^{r},

λ_{(u, v)}^{r} (x, λ) = λ^{r} k^{'} = 1 \prod k_{p^{r}} (v) (1 - x_{p_{k^{'}}^{r} i^{r}}), for (v, u) \in p^{r},

ρ_{(u, v)} (x, λ) = \frac{1}{μ _{(u, v)}} \sum_{r \in R : (v, u) \in p^{r}} λ_{(u, v)}^{r} (x, λ) .

ρ_{(u, v)} (x, λ) = \frac{1}{μ _{(u, v)}} \sum_{r \in R : (v, u) \in p^{r}} λ_{(u, v)}^{r} (x, λ) .

π (n) = \prod_{e \in E} π_{e} (n_{e}), n \in Ω,

π (n) = \prod_{e \in E} π_{e} (n_{e}), n \in Ω,

Λ_{x} := {λ : λ \geq 0 : ρ_{e} (x, λ) < 1, \forall e \in E} \subset R_{+}^{∣ R ∣},

Λ_{x} := {λ : λ \geq 0 : ρ_{e} (x, λ) < 1, \forall e \in E} \subset R_{+}^{∣ R ∣},

D_{λ} = {x \in D : ρ_{e} (x, λ) < 1, \forall e \in E} \subseteq D

D_{λ} = {x \in D : ρ_{e} (x, λ) < 1, \forall e \in E} \subseteq D

C (x) = \sum_{e \in E} C_{e} (ρ_{e} (x, λ)),

C (x) = \sum_{e \in E} C_{e} (ρ_{e} (x, λ)),

x \in D_{λ},

C_{e} (ρ_{e}) \equiv E [c_{e} (n)] = c_{e} (0) + \sum_{n = 0}^{\infty} (c_{e} (n + 1) - c_{e} (n)) ρ_{e}^{n} .

C_{e} (ρ_{e}) \equiv E [c_{e} (n)] = c_{e} (0) + \sum_{n = 0}^{\infty} (c_{e} (n + 1) - c_{e} (n)) ρ_{e}^{n} .

F (x) = C (x_{0}) - C (x)

F (x) = C (x_{0}) - C (x)

x \in D, x \geq x_{0}

\tilde{D} = conv ({x : x \in D, x \geq x_{0}}) \subseteq [0, 1]^{∣ V ∣∣ C ∣}

\tilde{D} = conv ({x : x \in D, x \geq x_{0}}) \subseteq [0, 1]^{∣ V ∣∣ C ∣}

G (y) = x \in {0, 1}^{∣ V ∣∣ C ∣} \sum F (x) \times (v, i) \in V \times C \prod y_{v i}^{x_{v i}} (1 - y_{v i})^{1 - x_{v i}},

G (y) = x \in {0, 1}^{∣ V ∣∣ C ∣} \sum F (x) \times (v, i) \in V \times C \prod y_{v i}^{x_{v i}} (1 - y_{v i})^{1 - x_{v i}},

m_{k}

m_{k}

y_{k + 1}

G (y)

G (y)

y \in \tilde{D} .

\frac{\partial G ( y )}{\partial y _{v i}}

\frac{\partial G ( y )}{\partial y _{v i}}

\frac{\partial G ( y )}{\partial y _{v i}} = \frac{1}{T} \sum_{ℓ = 1}^{T} (F ([x^{ℓ}]_{+ (v, i)}) - F ([x^{ℓ}]_{- (v, i)})),

\frac{\partial G ( y )}{\partial y _{v i}} = \frac{1}{T} \sum_{ℓ = 1}^{T} (F ([x^{ℓ}]_{+ (v, i)}) - F ([x^{ℓ}]_{- (v, i)})),

G (y^{K}) \geq (1 - (1 - δ)^{1/ δ}) G (y^{*}) \geq (1 - 1/ e) G (y^{*}),

G (y^{K}) \geq (1 - (1 - δ)^{1/ δ}) G (y^{*}) \geq (1 - 1/ e) G (y^{*}),

f (x) = \sum_{s \in S} β_{s} \cdot \prod_{j \in I (s)} (1 - x_{j}),

f (x) = \sum_{s \in S} β_{s} \cdot \prod_{j \in I (s)} (1 - x_{j}),

f (x) = s \in S \sum β_{s} t \in I (s) \prod (1 - x_{t})

f (x) = s \in S \sum β_{s} t \in I (s) \prod (1 - x_{t})

E_{y} [f (x)]

E_{y} [f (x)]

= s \in S \sum β_{s} t \in I (s) \prod (1 - E_{y} [x_{t}]), by independence

= s \in S \sum β_{s} t \in I (s) \prod (1 - y_{t}) . \qed

\frac{\partial G ( y )}{\partial y _{v i}} \approx e \in E \sum k = 1 \sum L α_{e}^{(k)} [ρ_{e}^{k} ([y]_{- (v, i)}, λ) - ρ_{e}^{k} ([y]_{+ (v, i)}, λ)]

\frac{\partial G ( y )}{\partial y _{v i}} \approx e \in E \sum k = 1 \sum L α_{e}^{(k)} [ρ_{e}^{k} ([y]_{- (v, i)}, λ) - ρ_{e}^{k} ([y]_{+ (v, i)}, λ)]

G (y_{K}) \geq (1 - \frac{1}{e}) G (y^{*}) - 2 D B - \frac{P}{2 K},

G (y_{K}) \geq (1 - \frac{1}{e}) G (y^{*}) - 2 D B - \frac{P}{2 K},

π (n) = e \in E \prod π_{e} (n_{e}), n \in Ω,

π (n) = e \in E \prod π_{e} (n_{e}), n \in Ω,

π_{e} (n_{e}) = (1 - ρ_{e}) r \in R : e \in p^{r} \prod (\frac{λ ^{r}}{μ _{e}})^{n_{e}^{r}} .

π_{e} (n_{e}) = (1 - ρ_{e}) r \in R : e \in p^{r} \prod (\frac{λ ^{r}}{μ _{e}})^{n_{e}^{r}} .

P [n_{e} = k] = (1 - ρ_{e}) ρ_{e}^{k}, k \in N .

P [n_{e} = k] = (1 - ρ_{e}) ρ_{e}^{k}, k \in N .

E [c_{e} (n_{e})]

E [c_{e} (n_{e})]

= c_{e} (0) + n = 0 \sum \infty (c_{e} (n + 1) - c_{e} (n)) P (n_{e} > n)

= \eqref q u e u es i z e c_{e} (0) + n = 0 \sum \infty (c_{e} (n + 1) - c_{e} (n)) ρ_{e}^{n}

g (x \cap x^{'}) \geq g (x) \geq g (x \cup x^{'}),

g (x \cap x^{'}) \geq g (x) \geq g (x \cup x^{'}),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Kelly Cache Networks

Milad Mahdian, Armin Moharrer, Stratis Ioannidis, and Edmund Yeh

Electrical and Computer Engineering, Northeastern University, Boston, MA, USA

{mmahdian,amoharrer,ioannidis,eyeh}@ece.neu.edu

Abstract

We study networks of M/M/1 queues in which nodes act as caches that store objects. Exogenous requests for objects are routed towards nodes that store them; as a result, object traffic in the network is determined not only by demand but, crucially, by where objects are cached. We determine how to place objects in caches to attain a certain design objective, such as, e.g., minimizing network congestion or retrieval delays. We show that for a broad class of objectives, including minimizing both the expected network delay and the sum of network queue lengths, this optimization problem can be cast as an NP-hard submodular maximization problem. We show that so-called continuous greedy algorithm [1] attains a ratio arbitrarily close to $1-1/e\approx 0.63$ using a deterministic estimation via a power series; this drastically reduces execution time over prior art, which resorts to sampling. Finally, we show that our results generalize, beyond M/M/1 queues, to networks of M/M/ $k$ and symmetric M/D/1 queues.

I Introduction

Kelly networks [2] are multi-class networks of queues capturing a broad array of queue service disciplines, including FIFO, LIFO, and processor sharing. Both Kelly networks and their generalizations (including networks of quasi-reversible and symmetric queues) are well studied and classic topics [2, 3, 4, 5]. One of their most appealing properties is that their steady-state distributions have a product-form: as a result, steady state properties such as expected queue sizes, packet delays, and server occupancy rates have closed-form formulas as functions of, e.g., routing and scheduling policies.

In this paper, we consider Kelly networks in which nodes are equipped with caches, i.e., storage devices of finite capacity, which can be used to store objects. Exogenous requests for objects are routed towards nodes that store them; upon reaching a node that stores the requested object, a response packet containing the object is routed towards the request source. As a result, object traffic in the network is determined not only by the demand but, crucially, by where objects are cached. This abstract setting is motivated by–and can be used to model–various networking applications involving the placement and transmission of content. This includes information centric networks [6, 7, 8], content delivery networks [9, 10], web-caches [11, 12, 13], wireless/femtocell networks [14, 15, 16], and peer-to-peer networks [17, 18], to name a few.

In many of these applications, determining the object placement, i.e., how to place objects in network caches, is a decision that can be made by the network designer in response to object popularity and demand. To that end, we are interested in determining how to place objects in caches so that traffic attains a design objective such as, e.g., minimizing delay.

We make the following contributions. First, we study the problem of optimizing the placement of objects in caches in Kelly cache networks of M/M/1 queues, with the objective of minimizing a cost function of the system state. We show that, for a broad class of cost functions, including packet delay, system size, and server occupancy rate, this optimization amounts to a submodular maximization problem with matroid constraints. This result applies to general Kelly networks with fixed service rates; in particular, it holds for FIFO, LIFO, and processor sharing disciplines at each queue.

The so-called continuous greedy algorithm [1] attains a $1-1/e$ approximation for this NP-hard problem. However, it does so by computing an expectation over a random variable with exponential support via randomized sampling. The number of samples required to attain the $1-1/e$ approximation guarantee can be prohibitively large in realistic settings. Our second contribution is to show that, for Kelly networks of M/M/1 queues, this randomization can be entirely avoided: a closed-form solution can be computed using the Taylor expansion of our problem’s objective. To the best of our knowledge, we are the first to identify a submodular maximization problem that exhibits this structure, and to exploit it to eschew sampling. Finally, we extend our results to networks of M/M/ $k$ and symmetric M/D/1 queues, and prove a negative result: submodularity does not arise in networks of M/M/1/ $k$ queues. We extensively evaluate our proposed algorithms over several synthetic and real-life topologies.

The remainder of our paper is organized as follows. We review related work in Sec. II. We present our mathematical model of a Kelly cache network in Sec. III, and our results on submodularity and the continuous-greedy algorithm in networks of M/M/1 queues in Sections IV and V, respectively. Our extensions are described in Sec. VI; our numerical evaluation is in Sec. VII. Finally, we conclude in Sec. VIII.

II Related Work

Our approach is closest to, and inspired by, recent work by Shanmugam et al. [19] and Ioannidis and Yeh [8]. Ioannidis and Yeh consider a setting very similar to ours but without queuing: edges are assigned a fixed weight, and the objective is a linear function of incoming traffic scaled by these weights. This can be seen as a special case of our model, namely, one where edge costs are linear (see also Sec. III-B). Shanmugam et al. [19] study a similar optimization problem, restricted to the context of femtocaching. The authors show that this is an NP-hard, submodular maximization problem with matroid constraints. They provide a $1-1/e$ approximation algorithm based on a technique by Ageev and Sviridenko [20]: this involves maximizing a concave relaxation of the original objective, and rounding via pipage-rounding[20]. Ioannidis and Yeh show that the same approximation technique applies to more general cache networks with linear edge costs. They also provide a distributed, adaptive algorithm that attains an $1-1/e$ approximation. The same authors extend this framework to jointly optimize both caching and routing decisions [21].

Our work can be seen as an extension of [8, 19], in that it incorporates queuing in the cache network. In contrast to both [8] and [19] however, costs like delay or queue sizes are highly non-linear in the presence of queuing. From a technical standpoint, this departure from linearity requires us to employ significantly different optimization methods than the ones in [8, 19]. In particular, our objective does not admit a concave relaxation and, consequently, the technique by Ageev and Sviridenko [20] used in [8, 19] does not apply. Instead, we must solve a non-convex optimization problem directly (c.f. Eq. (13)) using the so-called continuous-greedy algorithm.

Several papers have studied the cache optimization problems under restricted topologies [22, 23, 24, 25, 9]. These works model the network as a bipartite graph: nodes generating requests connect directly to caches in a single hop. The resulting algorithms do not readily generalize to arbitrary topologies. In general, the approximation technique of Ageev and Sviridenko [20] applies to this bipartite setting, and additional approximation algorithms have been devised for several variants [22, 23, 24, 9]. We differ by (a) considering a multi-hop setting, and (b) introducing queuing, which none of the above works considers.

Submodular function maximization subject to matroid constraints appears in many important problems in combinatorial optimization; for a brief review of the topic and applications, see [26] and [27], respectively. Nemhauser et al. [28] show that the greedy algorithm produces a solution within 1/2 of the optimal. Vondrák [29] and Calinescu et al. [1] show that the continuous-greedy algorithm produces a solution within $(1-1/e)$ of the optimal in polynomial time, which cannot be further improved [30]. In the general case, the continuous-greedy algorithm requires sampling to estimate the gradient of the so-called multilinear relaxation of the objective (see Sec. V). One of our main contributions is to show that MaxCG, the optimization problem we study here, exhibits additional structure: we use this to construct a sampling-free estimator of the gradient via a power-series or Taylor expansion. To the best of our knowledge, we are the first to use such an expansion to eschew sampling; this technique may apply to submodular maximization problems beyond MaxCG.

III Model

Motivated by applications such as ICNs [6], CDNs [9, 10], and peer-to-peer networks [17], we introduce Kelly cache networks. In contrast to classic Kelly networks, each node is associated with a cache of finite storage capacity. Exogenous traffic consisting of requests is routed towards nodes that store objects; upon reaching a node that stores the requested object, a response packet containing the object is routed towards the node that generated the request. As a result, content traffic in the network is determined not only by demand but, crucially, by how contents are cached. For completeness, we review classic Kelly networks in Appendix A. An illustration highlighting the differences between Kelly cache networks, introduced below, and classic Kelly networks, can be found in Fig. 1.

Although we describe Kelly cache networks in terms of FIFO M/M/1 queues, the product form distribution (c.f. (4)) arises for many different service principles beyond FIFO (c.f. Section 3.1 of [2]) including Last-In First-Out (LIFO) and processor sharing. All results we present extend to these service disciplines; we discuss more extensions in Sec. VI.

III-A Kelly Cache Networks

Graphs and Paths. We use the notation $G(V,E)$ for a directed graph $G$ with nodes $V$ and edges $E\subseteq V\times V$ . A directed graph is called symmetric or bidirectional if $(u,v)\in E$ if and only if $(v,u)\in E$ . A path $p$ is a sequence of adjacent nodes, i.e., $p=p_{1},p_{2},\ldots,p_{K}$ where $(p_{k},p_{k+1})\in E$ , for all $1\leq i<K\equiv|p|$ . A path is simple if it contains no loops (i.e., each node appears once). We use the notation $v\in p$ , where $v\in V$ , to indicate that node $v$ appears in the path, and $e\in p$ , where $e=(u,v)\in E$ , to indicate that nodes $u$ , $v$ are two consecutive (and, therefore, adjacent) nodes in $p$ . For $v\in p$ , where $p$ is simple, we denote by $k_{p}(v)\in\{1,\ldots,|p|\}$ the position of node $v\in V$ in $p$ , i.e., $k_{p}(v)=k$ if $p_{k}=v$ .

Network Definition. Formally, we consider a Kelly network of M/M/1 FIFO queues, represented by a symmetric directed graph $G(V,E)$ . As in classic Kelly networks, each edge $e\in E$ is associated with an M/M/1 queue with service rate $\mu_{e}$ 111We associate queues with edges for concreteness. Alternatively, queues can be associated with nodes, or both nodes and edges; all such representations lead to product form distributions (4), and all our results extend to these cases.. In addition, each node has a cache that stores objects of equal size from a set $\mathcal{C}$ , the object catalog. Each node $v\in V$ may store at most $c_{v}\in\mathbb{N}$ objects from $\mathcal{C}$ in its cache. Hence, if $x_{vi}\in\{0,1\}$ is a binary variable indicating whether node $v\in V$ is storing object $i\in\mathcal{C}$ , then $\sum_{i\in\mathcal{C}}x_{vi}\leq c_{v},$ for all $v\in V.$ We refer to $\mathbf{x}=[x_{vi}]_{v\in V,i\in\mathcal{C}}\in\{0,1\}^{|V||\mathcal{C}|}$ as the global placement or, simply, placement vector. We denote by

[TABLE]

the set of feasible placements that satisfy the storage capacity constraints. We assume that for every object $i\in\mathcal{C}$ , there exists a set of nodes $\mathcal{S}_{i}\subseteq V$ that permanently store $i$ . We refer to nodes in $\mathcal{S}_{i}$ as designated servers for $i\in\mathcal{C}$ . We assume that designated servers store $i$ in permanent storage outside their cache. Put differently, the aggregate storage capacity of a node is $c_{v}^{\prime}=c_{v}+|\{i:v\in\mathcal{S}_{i}\}|$ , but only the non-designated slots $c_{v}$ are part of the system’s design.

Object Requests and Responses. Traffic in the cache network consists of two types of packets: requests and responses, as shown in Fig. 1(b). Requests for an object are always routed towards one of its designated servers, ensuring that every request is satisfied. However, requests may terminate early: upon reaching any node that caches the requested object, the latter generates a response carrying the object. This is forwarded towards the request’s source, following the same path as the request, in reverse. Consistent with prior literature [8, 21], we treat request traffic as negligible when compared to response traffic, which carries objects, and henceforth focus only on queues bearing response traffic.

Formally, a request and its corresponding response are fully characterized by (a) the object being requested, and (b) the path that the request follows. That is, for the set of requests $\mathcal{R}$ , a request $r\in\mathcal{R}$ is determined by a pair $(i^{r},p^{r})$ , where $i^{r}\in\mathcal{C}$ is the object being requested and $p^{r}$ is the path the request follows. Each request $r$ is associated with a corresponding Poisson arrival process with rate $\lambda^{r}\geq 0$ , independent of other arrivals and service times. We denote the vector of arrival rates by $\bm{\lambda}=[\lambda^{r}]_{r\in\mathcal{R}}\in\mathbb{R}_{+}^{|\mathcal{R}|}.$ For all $r\in\mathcal{R}$ , we assume that the path $p^{r}$ is well-routed [8], that is: (a) path $p^{r}$ is simple, (b) the terminal node of the path is a designated server, i.e., a node in $\mathcal{S}_{i}$ , and (c) no other intermediate node in $p^{r}$ is a designated server. As a result, requests are always served, and response packets (carrying objects) always follow a sub-path of $p^{r}$ in reverse towards the request source (namely, $p^{r}_{1}$ ).

Steady State Distribution. Given an object placement $\mathbf{x}\in\mathcal{D}$ , the resulting system is a multi-class Kelly network, with packet classes determined by the request set $\mathcal{R}$ . This is a Markov process over the state space determined by queue contents. In particular, let $n_{e}^{r}$ be the number of packets of class $r\in\mathcal{R}$ in queue $e\in E$ , and $n_{e}=\sum_{r\in\mathcal{R}}n_{e}^{r}$ be the total queue size. The state of a queue $\mathbf{n}_{e}\in\mathcal{R}^{n_{e}}$ , $e\in E$ , is the vector of length $n_{e}$ representing the class of each packet in each position of the queue. The system state is then given by $\mathbf{n}=[\mathbf{n}_{e}]_{e\in E}$ ; we denote by $\Omega$ the state space of this Markov process.

In contrast to classic Kelly networks, network traffic and, in particular, the load on each queue, depend on placement $\mathbf{x}$ . Indeed, if $(v,u)\in p^{r}$ for $r\in\mathcal{R}$ , the arrival rate of responses of class $r\in\mathcal{R}$ in queue $(u,v)\in E$ is:

[TABLE]

i.e., responses to requests of class $r$ pass through edge $(u,v)\in E$ if and only if no node preceding $u$ in the path $p^{r}$ stores object $i^{r}$ –see also Fig. 1(b). As $\mu_{(u,v)}$ is the service rate of the queue in $(u,v)\in E$ , the load on edge $(u,v)\in E$ is:

[TABLE]

The Markov process $\{\mathbf{n}(t);t\geq 0\}_{t\geq 0}$ is positive recurrent when $\rho_{(u,v)}(\mathbf{x},\bm{\lambda})<1$ , for all $(u,v)\in E$ [2, 31]. Then, the steady-state distribution has a product form, i.e.:

[TABLE]

where $\textstyle\pi_{e}(\mathbf{n}_{e})=(1-\rho_{e}(\mathbf{x},\bm{\lambda}))\prod_{r\in\mathcal{R}:e\in p^{r}}\left(\frac{\lambda^{r}_{e}(\mathbf{x},\bm{\lambda})}{\mu_{e}}\right)^{n_{e}^{r}},$ and $\lambda_{e}^{r}(\mathbf{x},\bm{\lambda})$ , $\rho_{e}(\mathbf{x},\bm{\lambda})$ are given by (2), (3), respectively.

Stability Region. Given a placement $\mathbf{x}\in\mathcal{D}$ , a vector of arrival rates $\bm{\lambda}=[\lambda^{r}]_{r\in\mathcal{R}}$ yields a stable (i.e., positive recurrent) system if and only if $\bm{\lambda}\in\Lambda_{\mathbf{x}}$ , where

[TABLE]

where loads $\rho_{e}$ , $e\in E$ , are given by (3). Conversely, given a vector $\bm{\lambda}\in\mathbb{R}_{+}^{|\mathcal{R}|}$ ,

[TABLE]

is the set of feasible placements under which the system is stable. It is easy to confirm that, by the monotonicity of $\rho_{e}$ w.r.t. $\mathbf{x}$ , if $\mathbf{x}\in\mathcal{D}_{\bm{\lambda}}$ and $\mathbf{x}^{\prime}\geq\mathbf{x},$ then $\mathbf{x}^{\prime}\in\mathcal{D}_{\bm{\lambda}}$ , where the vector inequality $\mathbf{x}^{\prime}\geq\mathbf{x}$ is component-wise. In particular, if $\mathbf{0}\in D_{\bm{\lambda}}$ (i.e., the system is stable without caching), then $\mathcal{D}_{\bm{\lambda}}=\mathcal{D}$ .

III-B Cache Optimization

Given a Kelly cache network represented by graph $G(V,E)$ , service rates $\mu_{e}$ , $e\in E$ , storage capacities $c_{v}$ , $v\in V$ , a set of requests $\mathcal{R}$ , and arrival rates $\lambda_{r}$ , for $r\in\mathcal{R}$ , we wish to determine placements $\mathbf{x}\in\mathcal{D}$ that optimize a certain design objective. In particular, we seek placements that are solutions to optimization problems of the following form:

[TABLE]

where $C_{e}:[0,1)\to\mathbb{R}_{+}$ , $e\in E$ , are positive cost functions, $\rho_{e}:\mathcal{D}\times\mathbb{R}_{+}^{|\mathcal{R}|}\to\mathbb{R}_{+}$ is the load on edge $e$ , given by (3), and $\mathcal{D}_{\bm{\lambda}}$ is the set of feasible placements that ensure stability, given by (6). We make the following standing assumption on the cost functions appearing in MinCost:

Assumption 1.

For all $e\in E$ , functions $C_{e}:[0,1)\to\mathbb{R}_{+}$ are convex and non-decreasing on $[0,1)$ .

Assumption 1 is natural; indeed it holds for many cost functions that often arise in practice. We list several examples:

Example 1. Queue Size: Under steady-state distribution (4), the expected number of packets in queue $e\in E$ is given by $\mathbb{E}[n_{e}]=C_{e}(\rho_{e})=\frac{\rho_{e}}{1-\rho_{e}},$ which is indeed convex and non-decreasing for $\rho_{e}\in[0,1)$ . Hence, the expected total number of packets in the system in steady state can indeed be written as the sum of such functions.

Example 2. Delay: From Little’s Theorem [31], the expected delay experienced by a packet in the system is $\mathbb{E}[T]=\frac{1}{\|\bm{\lambda}\|_{1}}\sum_{e\in E}\mathbb{E}[n_{e}],$ where $\|\bm{\lambda}\|_{1}=\sum_{r\in\mathcal{R}}\lambda^{r}$ is the total arrival rate, and $\mathbb{E}[n_{e}]$ is the expected size of each queue. Thus, the expected delay can also be written as the sum of functions that satisfy Assumption 1. We note that the same is true for the sum of the expected delays per queue $e\in E$ , as the latter are given by $\mathbb{E}[T_{e}]=\frac{1}{\lambda_{e}}\mathbb{E}[n_{e}]=\frac{1}{\mu_{e}(1-\rho_{e})},$ which are also convex and non-decreasing in $\rho_{e}$ .

Example 3. Queuing Probability/Load per Edge: In a FIFO queue, the queuing probability is the probability of arriving in a system where the server is busy; this is given by $C_{e}(\rho_{e})=\rho_{e}=\lambda_{e}/\mu_{e}$ , which is again non-decreasing and convex. This is also, of course, the load per edge. By treating $1/\mu_{e}$ as the weight of edge $e\in E$ , this setting recovers the objective of [8] as a special case of our model.

Example 4. Monotone Separable Costs: More generally, consider a state-dependent cost function $c:\Omega\to\mathbb{R}_{+}$ that satisfies the following three properties: (1) it is separable across queues, (2) it depends only on queue sizes $n_{e}$ , and (3) it is non-decreasing w.r.t. these queue sizes. Formally, $c(\mathbf{n})=\sum_{e\in E}c_{e}(n_{e}),$ where $c_{e}:\mathbb{N}\to\mathbb{R}_{+}$ , $e\in E$ , are non-decreasing functions of the queue sizes. Then, the steady state cost under distribution (4) has precisely form (7a) with convex costs, i.e., $\mathbb{E}[c(\mathbf{n})]=\sum_{e\in E}C_{e}(\rho_{e})$ where $C_{e}:[0,1)\to\mathbb{R}_{+}$ satisfy Assumption 1. This follows from the fact that:

[TABLE]

The proof is in Appendix B.

In summary, MinCost captures many natural cost objectives, while Assumption 1 holds for any monotonically increasing cost function that depends only on queue sizes.

IV Submodularity and the Greedy Algorithm

Problem MinCost is NP-hard; this is true even when cost functions $c_{e}$ are linear, and the objective is to minimize the sum of the loads per edge [8, 19]. In what follows, we outline our methodology for solving this problem; it relies on the fact that the objective of MinCost is a supermodular set function; our first main contribution is to show that this property is a direct consequence of Assumption 1.

Cost Supermodularity and Caching Gain. First, observe that the cost function $C$ in MinCost can be naturally expressed as a set function. Indeed, for $S\subset V\times\mathcal{C}$ , let $\mathbf{x}_{S}\in\{0,1\}^{|V||\mathcal{C}|}$ be the binary vector whose support is $S$ (i.e., its non-zero elements are indexed by $S$ ). As there is a 1-1 correspondence between a binary vector $\mathbf{x}$ and its support $\mathop{\mathtt{supp}}(\mathbf{x})$ , we can interpret $C:\{0,1\}^{|V||\mathcal{C}|}\to\mathbb{R}_{+}$ as set function $C:V\times\mathcal{C}:\to\mathbb{R}_{+}$ via $C(S)\triangleq C(\mathbf{x}_{S}).$ Then, the following theorem holds:

Theorem 1.

Under Assumption 1, $C(S)\triangleq C(\mathbf{x}_{S})$ is non-increasing and supermodular over $\{\mathop{\mathtt{supp}}(\mathbf{x}):\mathbf{x}\in\mathcal{D}_{\bm{\lambda}}\}$ .

A detailed proof of Theorem 1 can be found in Appendix C. In light of the observations in Sec. III-B regarding Assumption 1, Thm. 1 implies that supermodularity arises for a broad array of natural cost objectives, including expected delay and system size; it also applies under the full generality of Kelly networks, including FIFO, LIFO, and round robin service disciplines. Armed with this theorem, we turn our attention to converting MinCost to a submodular maximization problem. In doing so, we face the problem that domain $\mathcal{D}_{\bm{\lambda}}$ , determined not only by storage capacity constraints, but also by stability, may be difficult to characterize. Nevertheless, we show that a problem that is amenable to approximation can be constructed, provided that a placement $\mathbf{x}_{0}\in\mathcal{D}_{\bm{\lambda}}$ is known.

In particular, suppose that we have access to a single $\mathbf{x}_{0}\in\mathcal{D}_{\bm{\lambda}}$ . We define the caching gain $F:\mathcal{D}_{\bm{\lambda}}\to\mathbb{R}_{+}$ as $F(\mathbf{x})=C(\mathbf{x}_{0})-C(\mathbf{x}).$ Note that, for $\mathbf{x}\geq\mathbf{x}_{0}$ , $F(\mathbf{x})$ is the relative decrease in the cost compared to the cost under $\mathbf{x}_{0}$ . We consider the following optimization problem:

[TABLE]

Observe that, if $\mathbf{0}\in\mathcal{D}_{\bm{\lambda}}$ , then $\mathcal{D}_{\bm{\lambda}}=\mathcal{D}$ ; in this case, taking $\mathbf{x}_{0}=\mathbf{0}$ ensures that problems MinCost and MaxCG are equivalent. If $\mathbf{x}_{0}\neq\textbf{0}$ , the above formulation attempts to maximize the gain restricted to placements $\mathbf{x}\in\mathcal{D}$ that dominate $\mathbf{x}_{0}$ : such placements necessarily satisfy $\mathbf{x}\in\mathcal{D}_{\bm{\lambda}}$ . Thm. 1 has the following immediate implication:

Corollary 1.

The caching gain $F(S)\triangleq F(\mathbf{x}_{S})$ is non-decreasing and submodular over $\{\mathop{\mathtt{supp}}(\mathbf{x}):\mathbf{x}\in\mathcal{D}_{\bm{\lambda}}\}$ .

Greedy Algorithm. Constraints (9b) define a (partition) matroid [1, 19]. This, along with the submodularity and monotonicity of $F$ imply that we can produce a solution within $\frac{1}{2}$ -approximation from the optimal via the greedy algorithm [32]. The algorithm, summarized in Alg. 1, iteratively allocates items to caches that yield the largest marginal gain. The solution produced by Algorithm 1 is guaranteed to be within a $\frac{1}{2}$ -approximation ratio of the optimal solution of MaxCG [28]. The approximation guarantee of $\frac{1}{2}$ is tight:

Lemma 1.

For any $\varepsilon>0$ , there exists a cache network the greedy algorithm solution is within $\frac{1}{2}+\varepsilon$ from the optimal, when the objective is the sum of expected delays per edge.

The proof of Lemma 1 can be found in Appendix D. The instance under which the bound is tight is given in Fig. 2. As we discuss in Sec. VII, the greedy algorithm performs well in practice for some topologies; however, Lemma 1 motivates us to seek alternative algorithms, that attain improved approximation guarantees.

V Continuous-Greedy Algorithm

The continuous-greedy algorithm by Calinescu et al. [1] attains a tighter guarantee than the greedy algorithm, raising the approximation ratio from $0.5$ to $1-1/e\approx 0.63$ . The algorithm maximizes the so-called multilinear extension of objective $F$ , thereby obtaining a fractional solution $Y$ in the convex hull of the constraint space. The resulting solution is then rounded to produce an integral solution.

V-A Algorithm Overview

Formally, the multilinear extension of the caching gain $F$ is defined as follows. Define the convex hull of the set defined by the constraints (9b) in MaxCG as:

[TABLE]

Intuitively, $\mathbf{y}\in\tilde{\mathcal{D}}$ is a fractional vector in $\mathbb{R}^{|V||\mathcal{D}|}$ satisfying the capacity constraints, and the bound $\mathbf{y}\geq\mathbf{x}_{0}$ .

Given a $\mathbf{y}\in\tilde{\mathcal{D}}$ , consider a random vector $\mathbf{x}$ in $\{0,1\}^{|V||\mathcal{C}|}$ generated as follows: for all $v\in V$ and $i\in\mathcal{C}$ , the coordinates $x_{vi}\in\{0,1\}$ are independent Bernoulli variables such that $\mathbf{P}(x_{vi}=1)=y_{vi}$ . The multilinear extension $G:\tilde{\mathcal{D}}\to\mathbb{R}_{+}$ of $F:\mathcal{D}_{\bm{\lambda}}\to\mathbb{R}_{+}$ is defined via following expectation $G(\mathbf{y})=\mathbb{E}_{\mathbf{y}}[F(\mathbf{x})]$ , parameterized by $\mathbf{y}\in\tilde{\mathcal{D}}$ , i.e.,

[TABLE]

The continuous-greedy algorithm, summarized in Alg. 2, proceeds by first producing a fractional vector $\mathbf{y}\in\tilde{\mathcal{D}}$ . Starting from $\mathbf{y}_{0}=\mathbf{x}_{0}$ , the algorithm iterates over:

[TABLE]

for an appropriately selected step size $\gamma_{k}\in[0,1]$ . Intuitively, this yields an approximate solution to the non-convex problem:

[TABLE]

Even though (13) is not convex, the output of Alg. 2 is within a $1-1/e$ factor from the optimal solution $\mathbf{y}^{*}\in\tilde{\mathcal{D}}$ of (13). This fractional solution can be rounded to produce a solution to MaxCG with the same approximation guarantee using either the pipage rounding [20] or the swap rounding [1, 33] schemes: we review both in Appendix E.

A Sampling-Based Estimator. Function $G$ , given by (11), involves a summation over $2^{|V||\mathcal{C}|}$ terms, and cannot be easily computed in polynomial time. Typically, a sampling-based estimator (see, e.g., [1]) is used instead. Function $G$ is linear when restricted to each coordinate $y_{vi}$ , for some $v\in V$ , $i\in\mathcal{C}$ (i.e., when all inputs except $y_{vi}$ are fixed). As a result, the partial derivative of $G$ w.r.t. $y_{vi}$ can be written as:

[TABLE]

where the last inequality is due to monotonicity of $F$ . One can thus estimate the gradient by (a) producing $T$ random samples $\mathbf{x}^{(\ell)}$ , $\ell=1,\ldots,T$ of the random vector $\mathbf{x}$ , consisting of independent Bernoulli coordinates, and (b) computing, for each pair $(v,i)\in V\times\mathcal{C}$ , the average

[TABLE]

where $[\mathbf{x}]_{+(v,i)}$ , $[\mathbf{x}]_{-(v,i)}$ are equal to vector $\mathbf{x}$ with the $(v,i)$ -th coordinate set to 1 and 0, respectively. Using this estimate, Alg. 2 attains an approximation ratio arbitrarily close to $1-1/e$ for appropriately chosen $T$ .

In particular, the following theorem holds:

Theorem 2.

[Calinescu et al. [1]] Consider Alg. 2, with $\nabla G(\mathbf{y}_{k})$ replaced by the sampling-based estimate $\widehat{\nabla G(\mathbf{y}^{k})}$ , given by (15). Set $T=\frac{10}{\delta^{2}}(1+\ln(|\mathcal{C}||{V}|))$ , and $\gamma=\delta$ , where $\delta=\frac{1}{40|\mathcal{C}||{V}|\cdot(\sum_{v\in{V}}c_{v})^{2}}.$ Then, the algorithm terminates after $K=1/\gamma=1/\delta$ steps and, with high probability,

[TABLE]

where $\mathbf{y}^{*}$ is an optimal solution to (13).

The proof of the theorem can be found in Appendix A of Calinescu et al. [1] for general submodular functions over arbitrary matroid constraints; we state Thm. 2 here with constants $T$ and $\gamma$ set specifically for our objective $G$ and our set of constraints $\tilde{D}$ .

Under this parametrization of $T$ and $\gamma$ , Alg. 2 runs in polynomial time. More specifically, note that $1/\delta=O(|\mathcal{C}||{V}|\cdot(\sum_{v\in{V}}c_{v})^{2})$ is polynomial in the input size. Moreover, the algorithm runs for $K=1/\delta$ iterations in total. Each iteration requires $T=O(\frac{1}{\delta^{2}}(1+\ln(|\mathcal{C}||V|)$ samples, each involving a polynomial computation (as $F$ can be evaluated in polynomial time). Finally, LP (12a) can be solved in polynomial time in the number of constraints and variables, which are $O(|V||\mathcal{C}|)$ .

V-B A Novel Estimator via Taylor Expansion

The classic approach to estimate the gradient via sampling has certain drawbacks. The number of samples $T$ required to attain the $1-1/e$ ratio is quadratic in $|V||\mathcal{C}|$ . In practice, even for networks and catalogs of moderate size (say, $|V|=|\mathcal{C}|=100$ ), the number of samples becomes prohibitive (of the order of $10^{8}$ ). Producing an estimate for $\nabla G$ via a closed form computation that eschews sampling thus has significant computational advantages. In this section, we show that the multilinear relaxation of the caching gain $F$ admits such a closed-form characterization.

We say that a polynomial $f:\mathbb{R}^{d}\to\mathbb{R}$ is in Weighted Disjunctive Normal Form (W-DNF) if it can be written as

[TABLE]

for some index set $\mathcal{S}$ , positive coefficients $\beta_{s}>0$ , and index sets $I(s)\subseteq\{1,\ldots,d\}$ . Intuitively, treating binary variables $x_{j}\in\{0,1\}$ as boolean values, each W-DNF polynomial can be seen as a weighted sum (disjunction) among products (conjunctions) of negative literals. These polynomials arise naturally in the context of our problem; in particular:

Lemma 2.

For all $k\geq 1$ , $\mathbf{x}\in\mathcal{D}$ , and $e\in E$ , $\rho^{k}_{e}(\mathbf{x},\bm{\lambda})$ is a W-DNF polynomial whose coefficients depend on $\bm{\lambda}$ .

Proof (Sketch).

The lemma holds for $k=1$ by (2) and (3). The lemma follows by induction, as W-DNF polynomials over binary $\mathbf{x}\in\mathcal{D}$ are closed under multiplication; this is because $(1-x)^{\ell}=(1-x)$ for all $\ell\geq 1$ when $x\in\{0,1\}$ . ∎

Hence, all load powers are W-DNF polynomials; a detailed proof be found in Appendix F. Expectations of W-DNF polynomials have a remarkable property:

Lemma 3.

Let $f:\mathcal{D}_{\bm{\lambda}}\to\mathbb{R}$ be a W-DNF polynomial, and let $\mathbf{x}\in\mathcal{D}$ be a random vector of independent Bernoulli coordinates parameterized by $\mathbf{y}\in\tilde{\mathcal{D}}$ . Then $\mathbb{E}_{\mathbf{y}}[f(\mathbf{x})]=f(\mathbf{y})$ , where $f(\mathbf{y})$ is the evaluation of the W-DNF polynomial representing $f$ over the real vector $\mathbf{y}$ .

Proof.

As $f$ is W-DNF, it can be written as

[TABLE]

for appropriate $\mathcal{S}$ , and appropriate $\beta_{s},\mathcal{I}(s)$ , where $s\in\mathcal{S}$ . Hence,

[TABLE]

Lemma 3 states that, to compute the expectation of a W-DNF polynomial $f$ over i.i.d. Bernoulli variables with expectations $\mathbf{y}$ , it suffices to evaluate $f$ over input $\mathbf{y}$ . Expectations computed this way therefore do not require sampling.

We leverage this property to approximate $\nabla G(\bm{y})$ by taking the Taylor expansion of the cost functions $C_{e}$ at each edge $e\in E$ . This allows us to write $C_{e}$ as a power series w.r.t. $\rho_{e}^{k}$ , $k\geq 1$ ; from Lemmas 2 and 3, we can compute the expectation of this series in a closed form. In particular, by expanding the series and rearranging terms it is easy to show the following lemma, which is proved in Appendix G:

Lemma 4.

Consider a cost function $C_{e}:[0,1)\to\mathbb{R}_{+}$ which satisfies Assumption 1 and for which the Taylor expansion exists at some $\rho^{*}\in[0,1)$ . Then, for $\mathbf{x}\in\mathcal{D}$ a random Bernoulli vector parameterized by $\mathbf{y}\in\tilde{\mathcal{D}}$ ,

[TABLE]

where, $\textstyle\alpha^{(k)}_{e}=\sum_{i=k}^{L}\frac{(-1)^{i-k}\binom{i}{k}}{i!}C^{(i)}_{e}(\rho^{*})(\rho^{*})^{i-k}$ for $k=0,1,\cdots,L,$ and the error of the approximation is: $\textstyle\frac{1}{(L+1)!}\sum_{e\in E}C^{(L+1)}_{e}(\rho^{\prime})\Big{[}\mathbb{E}_{[\mathbf{y}]_{-{v,i}}}[(\rho_{e}(\mathbf{x},\bm{\lambda})-\rho^{*})^{L+1}]\textstyle-\mathbb{E}_{[\mathbf{y}]_{+{v,i}}}[(\rho_{e}(\mathbf{x},\bm{\lambda})-\rho^{*})^{L+1}]\Big{]}.$

Estimator (17) is deterministic: no random sampling is required. Moreover, Taylor’s theorem allows us to characterize the error (i.e., the bias) of this estimate. We use this to characterize the final fractional solution $\mathbf{y}$ produced by Alg. 2:

Theorem 3.

Assume that all $C_{e}$ , $e\in E$ , satisfy Assumption 1, are $L+1$ -differentiable, and that all their $L+1$ derivatives are bounded by $W\geq 0$ . Then, consider Alg. 2, in which $\nabla G(\mathbf{y}_{k})$ is estimated via the Taylor estimator (17), where each edge cost function is approximated at $\rho_{e}^{*}=\mathbb{E}_{\mathbf{y}_{k}}[\rho_{e}(\mathbf{x},\bm{\lambda})]=\rho_{e}(\mathbf{y}_{k},\bm{\lambda}).$ Then,

[TABLE]

where $K=\frac{1}{\gamma}$ is the number of iterations, $\mathbf{y}^{*}$ is an optimal solution to (13), $D=\max_{\mathbf{y}\in\tilde{\mathcal{D}}}\|\mathbf{y}\|_{2}\leq|V|\cdot\max\limits_{v\in{V}}c_{v},$ is the diameter of $\tilde{\mathcal{D}}$ , $B\leq\frac{W|E|}{(L+1)!}$ is the bias of estimator (17), and $P=2C(\mathbf{x}_{0}),$ is a Lipschitz constant of $\nabla G$ .

The proof can be found in Appendix H. The theorem immediately implies that we can replace (17) as an estimator in Alg. 2, and attain an approximation arbitrarily close to $1-1/e$ .

Estimation via Power Series. For arbitrary $L+1$ -differentiable cost functions $C_{e}$ , the estimator (17) can be leveraged by replacing $C_{e}$ with its Taylor expansion. In the case of queue-dependent cost functions, as described in Example 4 of Section III-B, the power-series (8) can be used instead. For example, the expected queue size (Example 1, Sec. III-B), is given by $C_{e}(\rho_{e})=\frac{\rho_{e}}{1-\rho_{e}}=\sum_{k=1}^{\infty}\rho_{e}^{k}.$ In contrast to the Taylor expansion, this power series does not depend on a point $\rho^{*}_{e}$ around which the function $C_{e}$ is approximated.

VI Beyond M/M/1 queues

As discussed in Section III, the classes of M/M/1 queues for which the supermodularity of the cost functions arises is quite broad, and includes FIFO, LIFO, and processor sharing queues. In this section, we discuss how our results extend to even broader families of queuing networks. Chapter 3 of Kelly [2] provides a general framework for a set of queues for which service times are exponentially distributed; for completeness, we also summarize this in Appendix I. A large class of networks can be modeled by this framework, including networks of M/M/ $k$ queues; all such networks maintain the property that steady-state distributions have a product form. This allows us to extend our results to M/M/ $k$ queues for two cost functions $C_{e}$ :

Lemma 5.

For a network of M/M/k queues, both the queuing probability222This is given by the so-called Erlang C formula [31]. and the expected queue size are non-increasing and supermodular over sets $\{\mathop{\mathtt{supp}}(\mathbf{x}):\mathbf{x}\in\mathcal{D}_{\bm{\lambda}}\}$ .

We note that, as an immediate consequence of Lemma 5 and Little’s theorem, both the sum of the expected delays per queue, but also the expected delay of an arriving packet, are also supermodular and non-decreasing.

Product-form steady-state distributions arise also in settings where service times are not exponentially distributed. A large class of quasi-reversible queues, named symmetric queues exhibit this property (c.f. Section 3.3 of [2] and Chapter 10 of [4]). For completeness, we again summarize symmetric queues in Appendix K. In the following lemma we leverage the product form of symmetric queues to extend our results to M/D/1 symmetric queues [31].

Lemma 6.

For a network of M/D/1 symmetric queues, the expected queue size is non-increasing and supermodular over sets $\{\mathop{\mathtt{supp}}(\mathbf{x}):\mathbf{x}\in\mathcal{D}_{\bm{\lambda}}\}$ .

Again, Lemma 6 and Little’s theorem imply that this property also extends to network delays. It is worth noting that conclusions similar to these in Lemmas 5 and 6 are not possible for all general queues with product form distributions. In particular, also we prove the following negative result:

Lemma 7.

There exists a network of M/M/1/k queues, containing a queue $e$ , for which no strictly monotone function $C_{e}$ of the load $\rho_{e}$ at a queue $e$ is non-increasing and supermodular over sets $\{\mathop{\mathtt{supp}}(\mathbf{x}):\mathbf{x}\in\mathcal{D}_{\bm{\lambda}}\}$ . In particular, the expected size of queue $e$ is neither monotone nor supermodular.

VII Numerical Evaluation

Networks. We execute Algorithms 1 and 2 over 9 network topologies, summarized in Table II. Graphs ER and ER-20Q are the same 100-node Erdős-Rényi graph with parameter $p=0.1$ . Graphs HC and HC-20Q are the same hypercube graph with 128 nodes, and graph star is a star graph with 100 nodes. The graph path is the topology shown in Fig. 2. The last 3 topologies, namely, dtelekom, geant, and abilene represent the Deutsche Telekom, GEANT, and Abilene backbone networks, respectively. The latter is also shown in Fig. 3.

Experimental Setup. For path and abilene, we set demands, storage capacities, and service rates as illustrated in Figures 2 and 3, respectively. Both of these settings induce an approximation ratio close to $1/2$ for greedy. For all remaining topologies, we consider a catalog of size $|\mathcal{C}|$ objects; for each object, we select 1 node uniformly at random (u.a.r.) from $V$ to serve as the designated server for this object. To induce traffic overlaps, we also select $|Q|$ nodes u.a.r. that serve as sources for requests; all requests originate from these sources. All caches are set to the same storage capacity, i.e., $c_{v}=c$ for all $v\in V$ . We generate a set of $|\mathcal{R}|$ possible types of requests. For each request type $r\in\mathcal{R}$ , $\lambda^{r}=1$ request per second, and path $p^{r}$ is generated by selecting a source among the $|Q|$ sources u.a.r., and routing towards the designated server of object $i^{r}$ using a shortest path algorithm. We consider two ways of selecting objects $i^{r}\in\mathcal{C}$ : in the uniform regime, $i^{r}$ is selected u.a.r. from the catalog $\mathcal{C}$ ; in the power-law regime, $i^{r}$ is selected from the catalog $\mathcal{C}$ via a power law distribution with exponent $1.2$ . All the parameter values, e.g., catalog size $|\mathcal{C}|$ , number of requests $|\mathcal{R}|$ , number of query sources $|Q|$ , and caching capacities $c_{v}$ are presented in Table II.

We construct heterogeneous service rates as follows. Every queue service rate is either set to a low value $\mu_{e}=\mu_{\text{low}}$ or a high value $\mu_{e}=\mu_{\text{high}},$ for all $e\in E.$ We select $\mu_{\text{low}}$ and $\mu_{\text{high}}$ as follows. Given the demands $r\in\mathcal{R}$ and the corresponding arrival rates $\lambda^{r}$ , we compute the highest load under no caching ( $\mathbf{x}=\mathbf{0}$ ), i.e., we find $\lambda_{\max}=\max_{e\in E}\sum_{r:e\in p^{r}}\lambda^{r}.$ We then set $\mu_{\text{low}}=\lambda_{\max}\times 1.05$ and $\mu_{\text{high}}=\lambda_{\max}\times 200$ . We set the service rate to $\mu_{\text{low}}$ for all congested edges, i.e., edges $e$ s.t. $\lambda_{e}=\lambda_{\max}$ . We set the service rate for each remaining edge $e\in E$ to $\mu_{\text{low}}$ independently with probability 0.7, and to $\mu_{\text{high}}$ otherwise. Note that, as a result $\textbf{0}\in\mathcal{D}_{\bm{\lambda}}=\mathcal{D}$ , i.e., the system is stable even in the absence of caching and, on average, 30 percent of the edges have a high service rate.

Placement Algorithms. We implement several placement algorithms: (a) Greedy, i.e., the greedy algorithm (Alg. 1), (b) Continuous-Greedy with Random Sampling (CG-RS), i.e., Algorithm 2 with a gradient estimator based on sampling, as described in Sec. V-A, (c) Continuous-Greedy with Taylor approximation (CGT), i.e., Algorithm 2 with a gradient estimator based on the Taylor expansion, as described in Sec. V-B, and (d) Continuous-Greedy with Power Series approximation (CG-PS), i.e., Algorithm 2 with a gradient estimator based on the power series expansion, described also in Sec. V-B. In the case of CG-RS, we collect 500 samples, i.e., $T=500$ . In the case of CG-PS we tried the first and second order expansions of the power series as CG-PS1 and CG-PS2, respectively. In the case of CGT, we tried the first-order expansion $(L=1)$ . In both cases, subsequent to the execution of Alg. 2 we produce an integral solution in $\mathcal{D}$ by rounding via the swap rounding method [33]. All continuous-greedy algorithms use $\gamma=0.001.$ We also implement a random selection algorithm (RND), which caches $c_{v}$ items at each node $v\in V$ , selected uniformly at random. We repeat RND 10 times, and report the average running time and caching gain.

Caching Gain Across Different Topologies. The caching gain $F(\mathbf{x})$ for $\mathbf{x}$ generated by different placement algorithms, is shown for power-law arrival distribution and uniform arrival distribution in Figures 4a and 4b, respectively. The values are normalized by the gains obtained by RND, reported in Table II. Also, the running times of the algorithms for power-law arrival distribution are reported in Fig. 5. As we see in Fig. 4, Greedy is comparable to other algorithms in most topologies. However, for topologies path and abilene Greedy obtains a sub-optimal solution, in comparison to the continuous-greedy algorithm. In fact, for path and abilene Greedy performs even worse than RND. In Fig. 4, we see that the continuous-greedy algorithms with gradient estimators based on Taylor and Power series expansion, i.e., CG-PS1, CG-PS2, and CGT outperform CG-RS500 in most topologies. Also, from Fig. 5, we see that CG-RS500 runs 100 times slower than the continuous-greedy algorithms with first-order gradient estimators, i.e., CG-PS1 and CGT. Note that 500 samples are significantly below the value, stated in Theorem 2, needed to attain the theoretical guarantees of the continuous-greedy algorithm, which is quadratic in $|V||\mathcal{C}|$ .

Varying Service Rates. For topologies path and abilene, the approximation ratio of Greedy is $\approx 0.5.$ This ratio is a function of service rate of the high-bandwidth link $M.$ In this experiment, we explore the effect of varying $M$ on the performance of the algorithms in more detail. We plot the caching gain obtained by different algorithms for path and abilene topologies, using different values of $M\in\{M_{\min},10,20,200\},$ where $M_{\min}$ is the value that puts the system on the brink of instability, i.e., 1 and $2+\epsilon$ for path and abilene, respectively. Thus, we gradually increase the discrepancy between the service rate of low-bandwidth and high-bandwidth links. The corresponding caching gains are plotted in Fig. 6, as a function of $M$ . We see that as $M$ increases the gain attained by Greedy worsens in both topologies: when $M=M_{\min}$ Greedy matches the performance of the continuous-greedy algorithms, in both cases. However, for higher values of $M$ it is beaten not only by all variations of the continuous-greedy algorithm, but by RND as well.

Effect of Congestion on Caching Gain. In this experiment, we study the effect of varying arrival rates $\lambda^{r}$ on caching gain $F$ . We report results only for the dtelekom and ER topologies and power-law arrival distribution. We obtain the cache placements $\mathbf{x}$ using the parameters presented in Table II and different arrival rates: $\lambda^{r}\in\{0.65,0.72,0.81,0.9,1.0\},$ for $r\in\mathcal{R}$ . Fig. 7 shows the caching gain attained by the placement algorithms as a function of arrival rates. We observe that as we increase the arrival rates, the caching gain attained by almost all algorithms, except RND, increases significantly. Moreover, CG-PS1, CG-PS2, CGT, and Greedy have a similar performance, while CG-RS500 achieves lower caching gains.

Varying Caching Capacity. In this experiment, we study the effect of increasing cache capacity $c_{v}$ on the acquired caching gains. Again, we report the results only for the dtelekom and ER topologies and power-law arrival distribution. We evaluate the caching gain obtained by different placement algorithms using the parameters of Table II and different caching capacities: $c_{v}\in\{1,3,10,30\}$ for $v\in V.$ The caching gain is plotted in Fig. 8. As we see, in all cases the obtained gain increases, as we increase the caching capacities. This is expected: caching more items reduces traffic and delay, increasing the gain.

VIII Conclusions

Our analysis suggests feasible object placements targeting many design objectives of interest, including system size and delay, can be determined using combinatorial techniques. Our work leaves the exact characterization of approximable objectives for certain classes of queues, including M/M/1/k queues, open. Our work also leaves open problems relating to stability. This includes the characterization of the stability region of arrival rates $\Lambda=\cup_{\mathbf{x}\in\mathcal{D}}\Lambda(\mathbf{x})$ . It is not clear whether determining membership in this set (or, equivalently, given $\lambda$ , determining whether there exists a $\mathbf{x}\in\mathcal{D}$ under which the system is stable) is NP-hard or not, and whether this region can be somehow approximated. Finally, all algorithms presented in this paper are offline: identifying how to determine placements in an online, distributed fashion, in a manner that attains a design objective (as in [8, 21]), or even stabilizes the system (as in [7]), remains an important open problem.

IX Acknowledgements

The authors gratefully acknowledge support from National Science Foundation grant NeTS-1718355, as well as from research grants by Intel Corp. and Cisco Systems.

Appendix A Kelly Networks

Kelly networks [2, 4, 3] (i.e., multi-class Jackson networks) are networks of queues operating under a fairly general service disciplines (including FIFO, LIFO, and processor sharing, to name a few). As illustrated in Fig. 1(a), a Kelly network can be represented by a directed graph $G(V,E)$ , in which each edge is associated with a queue. In the case of First-In First-Out (FIFO) queues, each edge/link $e\in E$ is associated with an M/M/1 queue with service rate $\mu_{e}\geq 0$ . In an open network, packets of exogenous traffic arrive, are routed through consecutive queues, and subsequently exit the network; the path followed by a packet is determined by its class.

Formally, let $\mathcal{R}$ be the set of packet classes. For each packet class $r\in\mathcal{R}$ , we denote by $p^{r}\subseteq V$ the simple path of adjacent nodes visited by a packet. Packets of class $r\in\mathcal{R}$ arrive according to an exogenous Poisson arrival process with rate $\lambda^{r}>0$ , independent of other arrival processes and service times. Upon arrival, a packet travels across nodes in $p^{r}$ , traversing intermediate queues, and exits upon reaching the terminal node in $p^{r}$ .

A Kelly network forms a Markov process over the state space determined by queue contents. In particular, let $n_{e}^{r}$ be the number of packets of class $r\in\mathcal{R}$ in queue $e\in E$ , and $n_{e}=\sum_{r\in\mathcal{R}}n_{e}^{r}$ be the total queue size. The state of a queue $\mathbf{n}_{e}\in\mathcal{R}^{n_{e}}$ , $e\in E$ , is the vector of length $n_{e}$ representing the class of each packet in each position of the queue. The system state is then given by $\mathbf{n}=[\mathbf{n}_{e}]_{e\in E}$ ; we denote by $\Omega$ the state space of this Markov process.

The aggregate arrival rate $\lambda_{e}$ at an edge $e\in E$ is given by $\lambda_{e}=\sum_{r:e\in p^{r}}\lambda^{r}$ , while the load at edge $e\in E$ is given by $\rho_{e}=\lambda_{e}/\mu_{e}$ . Kelly’s extension of Jackson’s theorem [2] states that, if $\rho_{e}<1$ for all $e\in E$ , the Markov process $\{\mathbf{n}(t);t\geq 0\}_{t\geq 0}$ is positive recurrent, and its steady-state distribution has the following product form:

[TABLE]

where

[TABLE]

As a consequence, the queue sizes $n_{e}$ , $e\in E$ , also have a product form distribution in steady state, and their marginals are given by:

[TABLE]

The steady-state distribution (19) holds for many different service principles beyond FIFO (c.f. Section 3.1 of [2] and Appendix I). In short, incoming packets can be placed in random position within the queue according to a given distribution, and the (exponentially distributed) service effort can be split across different positions, possibly unequally; both placement and service effort distributions are class-independent. This captures a broad array of policies including FIFO, Last-In First-Out (LIFO), and processor sharing: in all cases, the steady-state distribution is given by (19).

Appendix B Monotone Separable Costs

Consider the state-dependent cost functions $c_{e}:\Omega\to\mathbb{R}_{+}$ introduced in Section III-B. The cost at state $\mathbf{n}\in\Omega$ can be written as $c(\mathbf{n})=\sum_{e\in E}c_{e}(n_{e}).$ Hence $\mathbb{E}[c(\mathbf{n})]=\sum_{e\in E}\mathbb{E}[c_{e}(n_{e})].$ On the other hand, as $c_{e}(n_{e})\geq 0$ , we have that

[TABLE]

As $c_{e}$ is non-decreasing, $c_{e}(n+1)-c_{e}(n)\geq 0$ for all $n\in\mathbb{N}$ . On the other hand, for all $n\in\mathbb{N}$ , $\rho^{n}$ is a convex non-decreasing function of $\rho$ in $[0,1)$ , so $\mathbb{E}[c_{e}(n_{e})]$ is a convex function of $\rho$ as a positively weighted sum of convex non-decreasing functions. ∎

Appendix C Proof of Theorem 1

We first prove the following auxiliary lemma:

Lemma 8.

Let $f:\mathbb{R}\to\mathbb{R}$ be a convex and non-decreasing function. Also, let $g:\mathcal{X}\to\mathbb{R}$ be a non-increasing supermodular set function. Then $h(\mathbf{x})\triangleq f(g(\mathbf{x}))$ is also supermodular.

Proof.

Since $g$ is non-increasing, for any $\mathbf{x},\mathbf{x}^{\prime}\subseteq\mathcal{X}$ we have

[TABLE]

Due to supermodularity of $g$ , we can find $\alpha,\alpha^{\prime}\in[0,1]$ , $\alpha+\alpha^{\prime}\leq 1$ such that

[TABLE]

Then, we have

[TABLE]

where the first inequality is due to convexity of $f$ , and the second one is because $\alpha+\alpha^{\prime}\leq 1$ and $f(g(.))$ is non-increasing. This proves $h(\mathbf{x})\triangleq f(g(\mathbf{x}))$ is supermodular. ∎

To conclude the proof of Thm. 1, observe that it is easy to verify that $\rho_{e},\forall e\in E$ , is supermodular and non-increasing in $S$ . Since, by Assumption 1, $C_{e}$ is a non-decreasing function, then, $C_{e}(S)\triangleq C_{e}(\rho_{u,v}(S))$ is non-increasing. By Lemma 8, $C_{s}(S)$ is also supermodular. Hence, the cost function is non-increasing and supermodular as the sum of non-increasing and supermodular functions.

Appendix D Proof of Lemma 1

Consider the path topology illustrated in Fig. 2. Assume that requests for files 1 and 2 are generated at node $u$ with rates $\lambda_{1}=\lambda_{2}=\delta$ , for some $\delta\in(0,1)$ . Files 1 and 2 are stored permanently at $v$ and $z$ , respectively. Caches exist only on $u$ and $w$ , and have capacity $c_{u}=c_{w}=1$ . Edges $(u,v)$ , $(w,z)$ have bandwidth $\mu_{(u,v)}=\mu_{(w,z)}=1$ , while edge $(u,w)$ is a high bandwidth link, having capacity $M\gg 1$ . Let $\mathbf{x}_{0}=\mathbf{0}$ . The greedy algorithm starts from empty caches and adds item 2 at cache $u$ . This is because the caching gain from this placement is $c_{(u,w)}+c_{(w,z)}=\frac{1}{M-\delta}+\frac{1}{1-\delta}$ , while the caching gain of all other decisions is at most $\frac{1}{1-\delta}$ . Any subsequent caching decisions do not change the caching gain. The optimal solution is to cache item 1 at $u$ and item 2 at $w$ , yielding a caching gain of $2/(1-\delta)$ . Hence, the greedy solution attains an approximation ratio $0.5\cdot(1+\frac{1-\delta}{M-\delta}).$ By appropriately choosing $M$ and $\delta$ , this can be made arbitrarily close to 0.5. ∎

Appendix E Rounding

Several poly-time algorithms can be used to round the fractional solution that is produced by Alg. 2 to an integral $\mathbf{x}\in\mathcal{D}$ . We briefly review two such rounding algorithms: pipage rounding [20], which is deterministic, and swap-rounding [33], which is randomized. For a more rigorous treatment, we refer the reader to [20, 8] for pipage rounding, and [33] for swap rounding.

Pipage rounding uses the following property of $G$ : given a fractional solution $\mathbf{y}\in\tilde{\mathcal{D}}$ , there are at least two fractional variables $y_{vi}$ and $y_{v^{\prime}i^{\prime}}$ , such that transferring mass from one to the other, $1)$ makes at least one of them 0 or 1, $2)$ the new $\hat{\mathbf{y}}$ remains feasible in $\tilde{\mathcal{D}}$ , and $3)$ $G(\hat{\mathbf{y}})\geq G(\mathbf{y}(1))$ , that is, the expected caching gain at $\hat{\mathbf{y}}$ is at least as good as $\mathbf{y}$ . This process is repeated until $\hat{\mathbf{y}}$ does not have any fractional element, at which point pipage rounding terminates and return $\hat{\mathbf{y}}$ . This procedure has a run-time of $O(|V||\mathcal{C}|)$ [8], and since each rounding step can only increase $G$ , it follows that the final integral $\hat{\mathbf{y}}\in\mathcal{D}$ must satisfy

[TABLE]

where $\mathbf{x}^{*}$ is an optimal solution to MaxCG. Here, the first equality holds because $F$ and $G$ are equal when their arguments are integral, while the last inequality holds because (13) is a relaxation of MaxCG, maximizing the same objective over a larger domain.

In swap rounding, given a fractional solution $\mathbf{y}\in\tilde{\mathcal{D}}$ produced by Alg. 2 observe that it can be written as a convex combination of integral vectors in $\mathcal{D}$ , i.e., $\mathbf{y}=\sum_{k=1}^{K}\gamma_{k}\mathbf{m}_{k},$ where $\gamma_{k}\in[0,1],\sum_{k=1}^{K}\gamma_{k}=1,$ and $\mathbf{m}_{k}\in\mathcal{D}$ .Moreover, by construction, each such vector $\mathbf{m}_{k}$ is maximal, in that all capacity constraints are tight. Swap rounding iteratively merges these constituent integral vectors, producing an integral solution. At each iteration $i$ , the present integral vector $\mathbf{c}_{k}$ is merged with $\mathbf{m}_{k+1}\in\mathcal{D}$ into a new integral solution $\mathbf{c}_{k+1}\in\mathcal{D}$ as follows: if the two solutions $\mathbf{c}_{k}$ , $\mathbf{m}_{k+1}$ differ at a cache $v\in V$ , items in this cache are swapped to reduce the set difference: either an item $i$ in a cache in $\mathbf{c}_{k}$ replaces an item $j$ in $\mathbf{m}_{k+1}$ , or an item $j$ in $\mathbf{m}_{k+1}$ replaces an item $i$ in $\mathbf{c}_{k}$ ; the former occurs with probability proportional to $\sum_{\ell=1}^{k}\gamma_{\ell}$ , and the latter with probability proportional to $\gamma_{k+1}$ . The swapping is repeated until the two integer solutions become identical; this merged solution becomes $\mathbf{c}_{k+1}$ . This process terminates after $K-1$ steps, after which all the points $\mathbf{m}_{k}$ are merged into a single integral vector $\mathbf{c}_{K}\in\mathcal{D}.$ Observe that, in contrast to pipage rounding, swap rounding does not require any evaluation of the objective $F$ during rounding. This makes swap rounding significantly faster to implement; this comes at the expense of the approximation ratio, however, as the resulting guarantee $1-1/e$ is in expectation.

Appendix F Proof of Lemma 2

We prove this by induction on $k\geq 1$ . Observe first that, by (3), the load on each edge $e=(u,v)\in E$ can be written as a polynomial of the following form:

[TABLE]

for appropriately defined

[TABLE]

In other words, $\rho_{e}:\mathcal{D}_{\bm{\lambda}}\to\mathbb{R}$ is indeed a W-DNF polynomial. For the induction step, observe that W-DNF polynomials, seen as functions over the integral domain $\mathcal{D}_{\bm{\lambda}}$ , are closed under multiplication. In particular, the following lemma holds:

Lemma 9.

Given two W-DNF polynomials $f_{1}:\mathcal{D}_{\bm{\lambda}}\to\mathbb{R}$ and $f_{2}:\mathcal{D}_{\bm{\lambda}}\to\mathbb{R}$ , given by

[TABLE]

their product $f_{1}\cdot f_{2}$ is also a W-DNF polynomial over $\mathcal{D}_{\bm{\lambda}}$ , given by:

[TABLE]

Proof.

To see this, observe that

[TABLE]

where $\triangle$ is the symmetric set difference. On the other hand, as $(1-x_{t})\in\{0,1\}$ , we have that $(1-x_{t})^{2}=(1-x_{t})$ , and the lemma follows. ∎

Hence, if $\rho_{e}^{k}(\mathbf{x},\bm{\lambda})$ is a W-DNF polynomial, by (23) and Lemma 9, so is $\rho_{e}^{k+1}(\mathbf{x},\bm{\lambda})$ .∎

Appendix G Proof of Lemma 4

The Taylor expansion of $C_{e}$ at $\rho^{*}$ is given by:

[TABLE]

where $\rho^{\prime}\in[\rho^{*},\rho]$ and $C_{e}^{(k)}$ is the $k$ -th order derivative of $C_{e}$ . By expanding this polynomial and reorganizing the terms, we get

[TABLE]

where

[TABLE]

for $k=0,1,\cdots,L.$ Consider now the $L$ -th order Taylor approximation of $C_{e}$ , given by

[TABLE]

Clearly, this is an estimator of $C_{e}$ , with an error of the order $|C_{e}(\rho)-\hat{C}_{e}(\rho)|=o\left((\rho-\rho_{*})^{L}\right).$ Thus, for $\mathbf{x}\in\mathcal{D}$ a random Bernoulli vector parameterized by $\mathbf{y}\in\tilde{\mathcal{D}}$ ,

[TABLE]

On the other hand, for all $v\in V$ and $i\in\mathcal{C}$ :

[TABLE]

where the error of the approximation is given by

[TABLE]

The lemma thus follows from Lemmas 2 and 3.

Appendix H Proof of Theorem 3

We begin by bounding the bias of estimator (LABEL:eq:taylorapprox). Indeed, given a set of continuous functions $\{C_{(u,v}\}_{(u,v)\in E}$ where their first $L+1$ derivatives within their operating regime, $[0,1)$ , are upperbounded by a finite constant, $W$ , the bias of estimator $\mathbf{z}\equiv[z_{vi}]_{v\in V,i\in\mathcal{C}}$ , where $z_{vi}$ is defined by (17), is given by

[TABLE]

where $\rho^{\prime}_{e}\in[\rho^{*}_{e},\rho_{e}]$ . To compute the bias, we note that $\rho_{e},\rho^{*}_{e}\in[0,1]$ . Specifically, we assume $\rho_{e},\rho^{*}_{e}\in[0,1)$ . Hence, $|\rho_{e}-\rho^{*}_{e}|\leq 1$ , and $C^{(L+1)}_{e}(\rho^{\prime}_{e})\leq\max\{C^{(L+1)}_{e}(\rho_{e}),C^{(L+1)}_{e}(\rho^{*}_{e})\}<\infty$ . In particular, let $W=\max_{e\in E}C^{(L+1)}_{e}(\rho^{\prime}_{e})$ . Then, it is easy to compute the following upper bound on the bias of $\mathbf{z}$ :

[TABLE]

In addition, note that $G$ is linear in $y_{vi}$ , and hence [1]:

[TABLE]

which is $\geq 0$ due to monotonicity of $F(\mathbf{x})$ . It is easy to verify that $\frac{\partial^{2}G}{\partial y_{vi}^{2}}=0$ . For $(v_{1},i_{1})\neq(v_{2},i_{2})$ , we can compute the second derivative of $G$ [1] as given by

[TABLE]

which is $\leq 0$ due to the supermodularity of $C(\mathbf{x})$ . Hence, $G(\mathbf{y})$ is component-wise concave [1] .

In additions, it is easy to see that for $\mathbf{y}\in\tilde{\mathcal{D}}$ , $||G(\mathbf{y})||$ , $||\triangledown G(\mathbf{y})||$ , and $||\triangledown^{2}G(\mathbf{y})||$ are bounded by $C(\mathbf{x}_{0})$ , $C(\mathbf{x}_{0})$ and $2C(\mathbf{x}_{0})$ , respectively. Consequently, $G$ and $\triangledown G$ are $P$ -Lipschitz continuous, with $P=2C(\mathbf{x}_{0})$ .

In the $k$ th iteration of the Continuous Greedy algorithm, let $\mathbf{m}^{*}=\mathbf{m}^{*}(\mathbf{y}_{k}):=(\mathbf{y}^{*}\vee(\mathbf{y}_{k}+\mathbf{y}_{0}))-\mathbf{y}_{k}=(\mathbf{y}^{*}-\mathbf{y}_{k})\vee\mathbf{y}_{0}\geq\mathbf{y}_{0}$ , where $x\vee y:=(\max\{x_{i},y_{i}\})_{i}$ . Since $\mathbf{m}^{*}\leq\mathbf{y}^{*}$ and $\mathcal{D}$ is closed-down, $\mathbf{m}^{*}\in\mathcal{D}$ . Due to monotonicity of $G$ , it follows

[TABLE]

We introduce univariate auxiliary function $g_{\mathbf{y},\mathbf{m}}(\xi):=G(\mathbf{y}+\xi\mathbf{m}),\xi\in[0,1],\mathbf{m}\in\tilde{\mathcal{D}}$ . Since $G(\mathbf{y})$ is component-wise concave, then, $g_{\mathbf{y},\mathbf{m}}(\xi)$ is concave in $[0,1]$ . In addition, since $g_{\mathbf{y}_{k},\mathbf{m}^{*}}(\xi)=G(\mathbf{y}_{k}+\xi\mathbf{m}^{*})$ is concave for $\xi\in[0,1]$ , it follows

[TABLE]

Now let $\mathbf{m}_{k}$ be the vector chosen by Algorithm 2 in the $k$ th iteration. We have

[TABLE]

For the LHS, we have

[TABLE]

where $D=\max_{\mathbf{m}\in\tilde{\mathcal{D}}}\|\mathbf{m}\|_{2}\leq|V|\cdot\max\limits_{v\in{V}}c_{v}$ , is the upperbound on the diameter of $\tilde{\mathcal{D}}$ , $B$ is as defined in (27), and (i) follows from Cauchy-Schwarz inequality. Similarly, we have for the RHS of that (31)

[TABLE]

It follows

[TABLE]

where $(a)$ follows from (30), and $(b)$ follows from (29).

Using the $P$ -Lipschitz continuity property of $\frac{dg_{\mathbf{y}_{k},\mathbf{m}_{k}}(\xi)}{d\xi}$ (due to $P$ -Lipschitz continuity of $\triangledown G$ ), it is straightforward to see that

[TABLE]

hence,

[TABLE]

where $(c)$ follows from (34), respectively. By rearranging the terms and letting $k=K-1$ , we have

[TABLE]

where $(e)$ is true since $1-x\leq e^{-x},\forall x\geq 0$ , and $G(\mathbf{y}_{0})\leq G(\mathbf{y}^{*})$ holds due to the greedy nature of Algorithm 2 and monotonicity of $G$ . In addition, Algorithm 2 ensures $\sum_{j=0}^{K-1}\gamma_{j}=1$ . It follows

[TABLE]

This result holds for general stepsizes $0<\gamma_{j}\leq 1$ . The RHS of (37) is indeed maximized when $\gamma_{j}=\frac{1}{K}$ , which is the assumed case in Algorithm 2. In addition, we have $\mathbf{y}_{0}=\mathbf{0}$ , and hence, $G(\mathbf{y}_{0})=0$ . Therefore, we have

[TABLE]

Appendix I General Kelly Networks

In Kelly’s network of queues (see Section 3.1 of [2] for more information), queue $e\in\{1,2,\cdots,|E|\}$ , assuming it contains $n_{e}$ packets in the queue, operates in the following manner:

Each packet (customer) requires an exponentially distributed amount of service. 2. 2.

A total service effort is provided by queue $e$ at the rate $\mu_{e}(n_{e})$ . 3. 3.

The packet in position $l$ in the queue is provided with a portion $\gamma_{e}(l,n_{e})$ of the total service effort, for $l=1,2,\cdots,n_{e}$ ; when this packet completes service and leaves the queue, packets in positions $l+1,l+2,\cdots,n_{e}$ move down to positions $l,l+1,\cdots,n_{e}-1$ , respectively. 4. 4.

An arriving packet at queue $j$ moves into position $l$ , for $l=1,2,\cdots,n_{e}$ , with probability $\delta_{e}(l,n_{e}+1)$ ; packets that where in positions $l,l+1,\cdots,n_{e}+1$ , move up to positions $l+1,l+2,\cdots,n_{e}+1$ , respectively.

Clearly, we require $\mu_{e}(n_{e})>0$ for $n_{e}>0$ ; in addition,

[TABLE]

Kelly’s theorem [2] states that, if $\rho_{e}<1$ for all $e\in E$ , the state of queue $e$ in equilibrium is independent of the rest of the system, hence, it will have a product form. In addition, the probability that queue $e$ contains $n_{e}$ packets is

[TABLE]

where $b_{e}$ is the normalizing factor. As can be seen from (41), note that the steady-state distribution is not function of $\gamma_{e}$ ’s, and $\delta_{e}(l,n_{e}+1)$ ’s, and hence, is independent of the packet placement and service allocation distributions.

We note that by allowing $\mu_{e}(l)=mu_{e}$ , we obtain the results in (21).

Appendix J Proof of Lemma 5

For an arbitrary network of M/M/k queues, the traffic load on queue $(u,v)\in{E}$ is given as

[TABLE]

which is similar to that of M/M/1 queues, but normalized by the number of servers, $k$ . Hence, $a_{(u,v)}(\mathbf{x})$ is submodular in $\mathbf{x}$ . For an M/M/k queue, the probability that an arriving packet finds all servers busy and will be forced to wait in queue is given by Erlang C formula [31], which follows

[TABLE]

where

[TABLE]

is the normalizing factor. In addition, the expected number of packets waiting for or under transmission is given by

[TABLE]

Lee and Cohen in [34], shows that $P_{(u,v)}^{Q}(\mathbf{x})$ and $\mathbb{E}[n_{(u,v)}(\mathbf{x})]$ are strictly increasing and convex in $a_{(u,v)}(\mathbf{x})$ , for $a_{(u,v)}(\mathbf{x})\in[0,1)$ . In addition, a more direct proof of convexity of $\mathbb{E}[n_{(u,v)}(\mathbf{x})]$ was shown by Grassmann in [35]. Hence, Both $P(\mathbf{x}):=\sum_{(u,v)\in{E}}P_{(u,v)}^{Q}(\mathbf{x})$ and $N(\mathbf{x}):=\sum_{(u,v)\in{E}}\mathbb{E}[n_{(u,v)}(\mathbf{x})]$ are increasing and convex. Due to Theorem 1, we note that both functions are non-increasing and supermodular in $\mathbf{x}$ , and the proof is complete.

Appendix K Networks of Symmetric Queues

Let $n_{e}$ be the number of packets placed in positions $1,2,\cdots,n$ in queue $e\in E$ . Queue $e$ is defined as symmetric queue if it operates in the following manner

The service requirement of a packet is a random variable whose distribution may depend upon the class of the customer. 2. 2.

A total service effort is provided by queue $e$ at the rate $\mu_{e}(n_{e})$ . 3. 3.

The packet in position $l$ in the queue is provided with a portion $\gamma_{e}(l,n_{e})$ of the total service effort, for $l=1,2,\cdots,n_{e}$ ; when this packet completes service and leaves the queue, packets in positions $l+1,l+2,\cdots,n_{e}$ move down to positions $l,l+1,\cdots,n_{e}-1$ , respectively. 4. 4.

An arriving packet at queue $e$ moves into position $l$ , for $l=1,2,\cdots,n_{e}$ , with probability $\gamma_{e}(l,n_{e}+1)$ ; packets that where in positions $l,l+1,\cdots,n_{e}+1$ , move up to positions $l+1,l+2,\cdots,n_{e}+1$ , respectively.

Similarly, we require $\mu_{e}(n_{e})>0$ for $n_{e}>0$ ; in addition,

[TABLE]

As shown in [2], and [4], symmetric queues have product form steady-state distributions. In particular, it turns out the probability of there are $n_{e}$ packets in queue $e$ is similar to that given by (41).

Appendix L Proof of Lemma 6

Let $\rho_{(u,v)}(\mathbf{x})$ be the traffic load on queue $(u,v)\in E$ , as defined by (3). It can be shown that the average number of packets in queue $(u,v)\in E$ is of form [31]

[TABLE]

It is easy to see that this function is strictly increasing and convex in $\rho_{(u,v)}(\mathbf{x})$ for $\rho_{(u,v)}(\mathbf{x})\in[0,1)$ . Due to Theorem 1, $N(\mathbf{x}):=\sum_{(u,v)\in E}\mathbb{E}[n_{(u,v)}(\mathbf{x})]$ is non-increasing and supermodular in $\mathbf{x}$ , and the proof is complete.

Appendix M Proof of Lemma 7

Consider the network of $M/M/1/k$ queues in Fig. 9, where node 1 is requesting content 1 from node 3, according to a Poisson process with rate $\lambda$ . For simplicity, we only consider the traffic for content 1. For queues $(2,1)$ and $(3,2)$ , it is easy to verify that the probability of packet drop at queues $(u,v)\in\{(2,1),(3,2)\}$ is given by

[TABLE]

where $\rho_{(u,v)}(\mathbf{x})$ is the traffic load on queue $(u,v)$ , and it can be computed for $(2,1)$ and $(3,2)$ as follows:

[TABLE]

Using the results reported in Table III, it is easy to verify that $\rho$ ’s are not monotone in $\mathbf{x}$ . Hence, no strictly monotone function of $\rho$ ’s are monotone in $\mathbf{x}$ . In addition, it can be verified that $\rho$ ’s are neither submodular, nor supermodular in $\mathbf{x}$ . To show this, let sets $A=\emptyset$ , and $B=\{(1,1)\}$ , correspond to caching configurations $[0,0]$ and $[1,0]$ , respectively. Note that $A\subset B$ , and $(2,1)\notin B$ . Since $\rho_{(3,2)}(A\cup\{(2,1)\})-\rho_{(3,2)}(A)=-\frac{\lambda}{\mu_{(3,2)}}\ngeqslant 0=\rho_{(3,2)}(B\cup\{(2,1)\})-\rho_{(3,2)}(B),$ then $\rho_{(3,2)}$ is not submodular. Consequently, no strictly monotone function of $\rho_{(3,2)}$ is submodular. Similarly, as $\rho_{(2,1)}(A\cup\{(2,1)\})-\rho_{(2,1)}(A)=\frac{\lambda p_{(3,2)}^{L}}{\mu_{(2,1)}}\nleqslant 0=\rho_{(2,1)}(B\cup\{(2,1)\})-\rho_{(2,1)}(B),$ $\rho_{(2,1)}$ is not supermodular. Thus, no strictly monotone function of $\rho_{(2,1)}$ is supermodular.

Bibliography35

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] G. Calinescu, C. Chekuri, M. Pál, and J. Vondrák, “Maximizing a monotone submodular function subject to a matroid constraint,” SIAM Journal on Computing , vol. 40, no. 6, pp. 1740–1766, 2011.
2[2] F. P. Kelly, Reversibility and stochastic networks . Cambridge University Press, 2011.
3[3] R. G. Gallager, Stochastic processes: theory for applications . Cambridge University Press, 2013.
4[4] R. Nelson, Probability, Stochastic Processes, and Queueing Theory: The Mathematics of Computer Performance Modeling , 1st ed. Springer Publishing Company, Incorporated, 2010.
5[5] H. Chen and D. D. Yao, Fundamentals of queueing networks: Performance, asymptotics, and optimization . Springer Science & Business Media, 2013, vol. 46.
6[6] V. Jacobson, D. K. Smetters, J. D. Thornton, M. F. Plass, N. H. Briggs, and R. L. Braynard, “Networking named content,” in Co NEXT , 2009.
7[7] E. Yeh, T. Ho, Y. Cui, M. Burd, R. Liu, and D. Leong, “VIP: A framework for joint dynamic forwarding and caching in named data networks,” in ICN , 2014.
8[8] S. Ioannidis and E. Yeh, “Adaptive caching networks with optimality guarantees,” in SIGMETRICS , 2016.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Kelly Cache Networks

Abstract

I Introduction

II Related Work

III Model

III-A Kelly Cache Networks

III-B Cache Optimization

Assumption 1**.**

IV Submodularity and the Greedy Algorithm

Theorem 1**.**

Corollary 1**.**

Lemma 1**.**

V Continuous-Greedy Algorithm

V-A Algorithm Overview

Theorem 2**.**

V-B A Novel Estimator via Taylor Expansion

Lemma 2**.**

Proof (Sketch).

Lemma 3**.**

Proof.

Lemma 4**.**

Theorem 3**.**

VI Beyond M/M/1 queues

Lemma 5**.**

Lemma 6**.**

Lemma 7**.**

VII Numerical Evaluation

VIII Conclusions

IX Acknowledgements

Appendix A Kelly Networks

Appendix B Monotone Separable Costs

Appendix C Proof of Theorem 1

Lemma 8**.**

Proof.

Appendix D Proof of Lemma 1

Appendix E Rounding

Appendix F Proof of Lemma 2

Lemma 9**.**

Proof.

Appendix G Proof of Lemma 4

Appendix H Proof of Theorem 3

Appendix I General Kelly Networks

Appendix J Proof of Lemma 5

Appendix K Networks of Symmetric Queues

Appendix L Proof of Lemma 6

Appendix M Proof of Lemma 7

Assumption 1.

Theorem 1.

Corollary 1.

Lemma 1.

Theorem 2.

Lemma 2.

Lemma 3.

Lemma 4.

Theorem 3.

Lemma 5.

Lemma 6.

Lemma 7.

Lemma 8.

Lemma 9.