Kelly Cache Networks
Milad Mahdian, Armin Moharrer, Stratis Ioannidis, Edmund Yeh

TL;DR
This paper investigates cache placement in queueing networks to optimize objectives like delay and congestion, proposing efficient algorithms and extending results to various queue types.
Contribution
It formulates cache placement as a submodular maximization problem and introduces a fast continuous greedy algorithm with near-optimal approximation guarantees.
Findings
The optimization problem is NP-hard but approximable.
The continuous greedy algorithm achieves near 63% of optimal.
Results extend to M/M/k and M/D/1 queue networks.
Abstract
We study networks of M/M/1 queues in which nodes act as caches that store objects. Exogenous requests for objects are routed towards nodes that store them; as a result, object traffic in the network is determined not only by demand but, crucially, by where objects are cached. We determine how to place objects in caches to attain a certain design objective, such as, e.g., minimizing network congestion or retrieval delays. We show that for a broad class of objectives, including minimizing both the expected network delay and the sum of network queue lengths, this optimization problem can be cast as an NP- hard submodular maximization problem. We show that so-called continuous greedy algorithm attains a ratio arbitrarily close to using a deterministic estimation via a power series; this drastically reduces execution time over prior art, which resorts to sampling.…
| Kelly Cache Networks | |
| Network graph, with nodes and edges | |
| position of node in path | |
| Service rate of edge | |
| Set of classes/types of requests | |
| Arrival rate of class | |
| Path followed by class | |
| Object requested by class | |
| Item catalog | |
| Set of designated servers of | |
| Cache capacity at node | |
| Variable indicating whether stores | |
| Placement vector of s, in | |
| Vector of arrival rates , | |
| Arrival rate of class responses over edge | |
| Load on edge | |
| State space | |
| Global state vector in | |
| Steady-state distribution of | |
| State vector of queue at edge | |
| Marginal of steady-state distribution of queue | |
| Size of queue at edge | |
| Cache Optimization | |
| Global Cost function | |
| Cost function of edge | |
| Set of placements satisfying capacity constraints | |
| A feasible placement in | |
| Caching gain of placement over | |
| Probability that stores | |
| Vector of marginal probabilities , in | |
| Multilinear extension under marginals | |
| Set of placements under which system is stable under arrivals | |
| Convex hull of constraints of MaxCG | |
| Conventions | |
| Support of a vector | |
| Convex hull of a set | |
| Vector equal to with -th coordinate set to 1 | |
| Vector equal to with -th coordinate set to 0 | |
| Vector of zeros | |
| Graph | ||||||||
|---|---|---|---|---|---|---|---|---|
| ER | 100 | 1042 | 300 | 1K | 4 | 3 | 2.75 | 2.98 |
| ER-20Q | 100 | 1042 | 300 | 1K | 20 | 3 | 3.1 | 2.88 |
| HC | 128 | 896 | 300 | 1K | 4 | 3 | 2.25 | 5.23 |
| HC-20Q | 128 | 896 | 300 | 1K | 20 | 3 | 2.52 | 5.99 |
| star | 100 | 198 | 300 | 1K | 4 | 3 | 6.08 | 8.3 |
| path | 4 | 3 | 2 | 2 | 1 | 1 | 1.2 | 1.2 |
| dtelekom | 68 | 546 | 300 | 1K | 4 | 3 | 2.57 | 3.66 |
| abilene | 11 | 28 | 4 | 2 | 2 | 1/2 | 4.39 | 4.39 |
| geant | 22 | 66 | 10 | 100 | 4 | 2 | 19.68 | 17.22 |
| 0 | 0 | |
| 0 | ||
| 0 | 0 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Kelly Cache Networks
Milad Mahdian, Armin Moharrer, Stratis Ioannidis, and Edmund Yeh
Electrical and Computer Engineering, Northeastern University, Boston, MA, USA
{mmahdian,amoharrer,ioannidis,eyeh}@ece.neu.edu
Abstract
We study networks of M/M/1 queues in which nodes act as caches that store objects. Exogenous requests for objects are routed towards nodes that store them; as a result, object traffic in the network is determined not only by demand but, crucially, by where objects are cached. We determine how to place objects in caches to attain a certain design objective, such as, e.g., minimizing network congestion or retrieval delays. We show that for a broad class of objectives, including minimizing both the expected network delay and the sum of network queue lengths, this optimization problem can be cast as an NP-hard submodular maximization problem. We show that so-called continuous greedy algorithm [1] attains a ratio arbitrarily close to using a deterministic estimation via a power series; this drastically reduces execution time over prior art, which resorts to sampling. Finally, we show that our results generalize, beyond M/M/1 queues, to networks of M/M/ and symmetric M/D/1 queues.
I Introduction
Kelly networks [2] are multi-class networks of queues capturing a broad array of queue service disciplines, including FIFO, LIFO, and processor sharing. Both Kelly networks and their generalizations (including networks of quasi-reversible and symmetric queues) are well studied and classic topics [2, 3, 4, 5]. One of their most appealing properties is that their steady-state distributions have a product-form: as a result, steady state properties such as expected queue sizes, packet delays, and server occupancy rates have closed-form formulas as functions of, e.g., routing and scheduling policies.
In this paper, we consider Kelly networks in which nodes are equipped with caches, i.e., storage devices of finite capacity, which can be used to store objects. Exogenous requests for objects are routed towards nodes that store them; upon reaching a node that stores the requested object, a response packet containing the object is routed towards the request source. As a result, object traffic in the network is determined not only by the demand but, crucially, by where objects are cached. This abstract setting is motivated by–and can be used to model–various networking applications involving the placement and transmission of content. This includes information centric networks [6, 7, 8], content delivery networks [9, 10], web-caches [11, 12, 13], wireless/femtocell networks [14, 15, 16], and peer-to-peer networks [17, 18], to name a few.
In many of these applications, determining the object placement, i.e., how to place objects in network caches, is a decision that can be made by the network designer in response to object popularity and demand. To that end, we are interested in determining how to place objects in caches so that traffic attains a design objective such as, e.g., minimizing delay.
We make the following contributions. First, we study the problem of optimizing the placement of objects in caches in Kelly cache networks of M/M/1 queues, with the objective of minimizing a cost function of the system state. We show that, for a broad class of cost functions, including packet delay, system size, and server occupancy rate, this optimization amounts to a submodular maximization problem with matroid constraints. This result applies to general Kelly networks with fixed service rates; in particular, it holds for FIFO, LIFO, and processor sharing disciplines at each queue.
The so-called continuous greedy algorithm [1] attains a approximation for this NP-hard problem. However, it does so by computing an expectation over a random variable with exponential support via randomized sampling. The number of samples required to attain the approximation guarantee can be prohibitively large in realistic settings. Our second contribution is to show that, for Kelly networks of M/M/1 queues, this randomization can be entirely avoided: a closed-form solution can be computed using the Taylor expansion of our problem’s objective. To the best of our knowledge, we are the first to identify a submodular maximization problem that exhibits this structure, and to exploit it to eschew sampling. Finally, we extend our results to networks of M/M/ and symmetric M/D/1 queues, and prove a negative result: submodularity does not arise in networks of M/M/1/ queues. We extensively evaluate our proposed algorithms over several synthetic and real-life topologies.
The remainder of our paper is organized as follows. We review related work in Sec. II. We present our mathematical model of a Kelly cache network in Sec. III, and our results on submodularity and the continuous-greedy algorithm in networks of M/M/1 queues in Sections IV and V, respectively. Our extensions are described in Sec. VI; our numerical evaluation is in Sec. VII. Finally, we conclude in Sec. VIII.
II Related Work
Our approach is closest to, and inspired by, recent work by Shanmugam et al. [19] and Ioannidis and Yeh [8]. Ioannidis and Yeh consider a setting very similar to ours but without queuing: edges are assigned a fixed weight, and the objective is a linear function of incoming traffic scaled by these weights. This can be seen as a special case of our model, namely, one where edge costs are linear (see also Sec. III-B). Shanmugam et al. [19] study a similar optimization problem, restricted to the context of femtocaching. The authors show that this is an NP-hard, submodular maximization problem with matroid constraints. They provide a approximation algorithm based on a technique by Ageev and Sviridenko [20]: this involves maximizing a concave relaxation of the original objective, and rounding via pipage-rounding[20]. Ioannidis and Yeh show that the same approximation technique applies to more general cache networks with linear edge costs. They also provide a distributed, adaptive algorithm that attains an approximation. The same authors extend this framework to jointly optimize both caching and routing decisions [21].
Our work can be seen as an extension of [8, 19], in that it incorporates queuing in the cache network. In contrast to both [8] and [19] however, costs like delay or queue sizes are highly non-linear in the presence of queuing. From a technical standpoint, this departure from linearity requires us to employ significantly different optimization methods than the ones in [8, 19]. In particular, our objective does not admit a concave relaxation and, consequently, the technique by Ageev and Sviridenko [20] used in [8, 19] does not apply. Instead, we must solve a non-convex optimization problem directly (c.f. Eq. (13)) using the so-called continuous-greedy algorithm.
Several papers have studied the cache optimization problems under restricted topologies [22, 23, 24, 25, 9]. These works model the network as a bipartite graph: nodes generating requests connect directly to caches in a single hop. The resulting algorithms do not readily generalize to arbitrary topologies. In general, the approximation technique of Ageev and Sviridenko [20] applies to this bipartite setting, and additional approximation algorithms have been devised for several variants [22, 23, 24, 9]. We differ by (a) considering a multi-hop setting, and (b) introducing queuing, which none of the above works considers.
Submodular function maximization subject to matroid constraints appears in many important problems in combinatorial optimization; for a brief review of the topic and applications, see [26] and [27], respectively. Nemhauser et al. [28] show that the greedy algorithm produces a solution within 1/2 of the optimal. Vondrák [29] and Calinescu et al. [1] show that the continuous-greedy algorithm produces a solution within of the optimal in polynomial time, which cannot be further improved [30]. In the general case, the continuous-greedy algorithm requires sampling to estimate the gradient of the so-called multilinear relaxation of the objective (see Sec. V). One of our main contributions is to show that MaxCG, the optimization problem we study here, exhibits additional structure: we use this to construct a sampling-free estimator of the gradient via a power-series or Taylor expansion. To the best of our knowledge, we are the first to use such an expansion to eschew sampling; this technique may apply to submodular maximization problems beyond MaxCG.
III Model
Motivated by applications such as ICNs [6], CDNs [9, 10], and peer-to-peer networks [17], we introduce Kelly cache networks. In contrast to classic Kelly networks, each node is associated with a cache of finite storage capacity. Exogenous traffic consisting of requests is routed towards nodes that store objects; upon reaching a node that stores the requested object, a response packet containing the object is routed towards the node that generated the request. As a result, content traffic in the network is determined not only by demand but, crucially, by how contents are cached. For completeness, we review classic Kelly networks in Appendix A. An illustration highlighting the differences between Kelly cache networks, introduced below, and classic Kelly networks, can be found in Fig. 1.
Although we describe Kelly cache networks in terms of FIFO M/M/1 queues, the product form distribution (c.f. (4)) arises for many different service principles beyond FIFO (c.f. Section 3.1 of [2]) including Last-In First-Out (LIFO) and processor sharing. All results we present extend to these service disciplines; we discuss more extensions in Sec. VI.
III-A Kelly Cache Networks
Graphs and Paths. We use the notation for a directed graph with nodes and edges . A directed graph is called symmetric or bidirectional if if and only if . A path is a sequence of adjacent nodes, i.e., where , for all . A path is simple if it contains no loops (i.e., each node appears once). We use the notation , where , to indicate that node appears in the path, and , where , to indicate that nodes , are two consecutive (and, therefore, adjacent) nodes in . For , where is simple, we denote by the position of node in , i.e., if .
Network Definition. Formally, we consider a Kelly network of M/M/1 FIFO queues, represented by a symmetric directed graph . As in classic Kelly networks, each edge is associated with an M/M/1 queue with service rate 111We associate queues with edges for concreteness. Alternatively, queues can be associated with nodes, or both nodes and edges; all such representations lead to product form distributions (4), and all our results extend to these cases.. In addition, each node has a cache that stores objects of equal size from a set , the object catalog. Each node may store at most objects from in its cache. Hence, if is a binary variable indicating whether node is storing object , then for all We refer to as the global placement or, simply, placement vector. We denote by
[TABLE]
the set of feasible placements that satisfy the storage capacity constraints. We assume that for every object , there exists a set of nodes that permanently store . We refer to nodes in as designated servers for . We assume that designated servers store in permanent storage outside their cache. Put differently, the aggregate storage capacity of a node is , but only the non-designated slots are part of the system’s design.
Object Requests and Responses. Traffic in the cache network consists of two types of packets: requests and responses, as shown in Fig. 1(b). Requests for an object are always routed towards one of its designated servers, ensuring that every request is satisfied. However, requests may terminate early: upon reaching any node that caches the requested object, the latter generates a response carrying the object. This is forwarded towards the request’s source, following the same path as the request, in reverse. Consistent with prior literature [8, 21], we treat request traffic as negligible when compared to response traffic, which carries objects, and henceforth focus only on queues bearing response traffic.
Formally, a request and its corresponding response are fully characterized by (a) the object being requested, and (b) the path that the request follows. That is, for the set of requests , a request is determined by a pair , where is the object being requested and is the path the request follows. Each request is associated with a corresponding Poisson arrival process with rate , independent of other arrivals and service times. We denote the vector of arrival rates by For all , we assume that the path is well-routed [8], that is: (a) path is simple, (b) the terminal node of the path is a designated server, i.e., a node in , and (c) no other intermediate node in is a designated server. As a result, requests are always served, and response packets (carrying objects) always follow a sub-path of in reverse towards the request source (namely, ).
Steady State Distribution. Given an object placement , the resulting system is a multi-class Kelly network, with packet classes determined by the request set . This is a Markov process over the state space determined by queue contents. In particular, let be the number of packets of class in queue , and be the total queue size. The state of a queue , , is the vector of length representing the class of each packet in each position of the queue. The system state is then given by ; we denote by the state space of this Markov process.
In contrast to classic Kelly networks, network traffic and, in particular, the load on each queue, depend on placement . Indeed, if for , the arrival rate of responses of class in queue is:
[TABLE]
i.e., responses to requests of class pass through edge if and only if no node preceding in the path stores object –see also Fig. 1(b). As is the service rate of the queue in , the load on edge is:
[TABLE]
The Markov process is positive recurrent when , for all [2, 31]. Then, the steady-state distribution has a product form, i.e.:
[TABLE]
where and , are given by (2), (3), respectively.
Stability Region. Given a placement , a vector of arrival rates yields a stable (i.e., positive recurrent) system if and only if , where
[TABLE]
where loads , , are given by (3). Conversely, given a vector ,
[TABLE]
is the set of feasible placements under which the system is stable. It is easy to confirm that, by the monotonicity of w.r.t. , if and then , where the vector inequality is component-wise. In particular, if (i.e., the system is stable without caching), then .
III-B Cache Optimization
Given a Kelly cache network represented by graph , service rates , , storage capacities , , a set of requests , and arrival rates , for , we wish to determine placements that optimize a certain design objective. In particular, we seek placements that are solutions to optimization problems of the following form:
[TABLE]
where , , are positive cost functions, is the load on edge , given by (3), and is the set of feasible placements that ensure stability, given by (6). We make the following standing assumption on the cost functions appearing in MinCost:
Assumption 1**.**
For all , functions are convex and non-decreasing on .
Assumption 1 is natural; indeed it holds for many cost functions that often arise in practice. We list several examples:
Example 1. Queue Size: Under steady-state distribution (4), the expected number of packets in queue is given by which is indeed convex and non-decreasing for . Hence, the expected total number of packets in the system in steady state can indeed be written as the sum of such functions.
Example 2. Delay: From Little’s Theorem [31], the expected delay experienced by a packet in the system is where is the total arrival rate, and is the expected size of each queue. Thus, the expected delay can also be written as the sum of functions that satisfy Assumption 1. We note that the same is true for the sum of the expected delays per queue , as the latter are given by which are also convex and non-decreasing in .
Example 3. Queuing Probability/Load per Edge: In a FIFO queue, the queuing probability is the probability of arriving in a system where the server is busy; this is given by , which is again non-decreasing and convex. This is also, of course, the load per edge. By treating as the weight of edge , this setting recovers the objective of [8] as a special case of our model.
Example 4. Monotone Separable Costs: More generally, consider a state-dependent cost function that satisfies the following three properties: (1) it is separable across queues, (2) it depends only on queue sizes , and (3) it is non-decreasing w.r.t. these queue sizes. Formally, where , , are non-decreasing functions of the queue sizes. Then, the steady state cost under distribution (4) has precisely form (7a) with convex costs, i.e., where satisfy Assumption 1. This follows from the fact that:
[TABLE]
The proof is in Appendix B.
In summary, MinCost captures many natural cost objectives, while Assumption 1 holds for any monotonically increasing cost function that depends only on queue sizes.
IV Submodularity and the Greedy Algorithm
Problem MinCost is NP-hard; this is true even when cost functions are linear, and the objective is to minimize the sum of the loads per edge [8, 19]. In what follows, we outline our methodology for solving this problem; it relies on the fact that the objective of MinCost is a supermodular set function; our first main contribution is to show that this property is a direct consequence of Assumption 1.
Cost Supermodularity and Caching Gain. First, observe that the cost function in MinCost can be naturally expressed as a set function. Indeed, for , let be the binary vector whose support is (i.e., its non-zero elements are indexed by ). As there is a 1-1 correspondence between a binary vector and its support , we can interpret as set function via Then, the following theorem holds:
Theorem 1**.**
Under Assumption 1, is non-increasing and supermodular over .
A detailed proof of Theorem 1 can be found in Appendix C. In light of the observations in Sec. III-B regarding Assumption 1, Thm. 1 implies that supermodularity arises for a broad array of natural cost objectives, including expected delay and system size; it also applies under the full generality of Kelly networks, including FIFO, LIFO, and round robin service disciplines. Armed with this theorem, we turn our attention to converting MinCost to a submodular maximization problem. In doing so, we face the problem that domain , determined not only by storage capacity constraints, but also by stability, may be difficult to characterize. Nevertheless, we show that a problem that is amenable to approximation can be constructed, provided that a placement is known.
In particular, suppose that we have access to a single . We define the caching gain as Note that, for , is the relative decrease in the cost compared to the cost under . We consider the following optimization problem:
[TABLE]
Observe that, if , then ; in this case, taking ensures that problems MinCost and MaxCG are equivalent. If , the above formulation attempts to maximize the gain restricted to placements that dominate : such placements necessarily satisfy . Thm. 1 has the following immediate implication:
Corollary 1**.**
The caching gain is non-decreasing and submodular over .
Greedy Algorithm. Constraints (9b) define a (partition) matroid [1, 19]. This, along with the submodularity and monotonicity of imply that we can produce a solution within -approximation from the optimal via the greedy algorithm [32]. The algorithm, summarized in Alg. 1, iteratively allocates items to caches that yield the largest marginal gain. The solution produced by Algorithm 1 is guaranteed to be within a -approximation ratio of the optimal solution of MaxCG [28]. The approximation guarantee of is tight:
Lemma 1**.**
For any , there exists a cache network the greedy algorithm solution is within from the optimal, when the objective is the sum of expected delays per edge.
The proof of Lemma 1 can be found in Appendix D. The instance under which the bound is tight is given in Fig. 2. As we discuss in Sec. VII, the greedy algorithm performs well in practice for some topologies; however, Lemma 1 motivates us to seek alternative algorithms, that attain improved approximation guarantees.
V Continuous-Greedy Algorithm
The continuous-greedy algorithm by Calinescu et al. [1] attains a tighter guarantee than the greedy algorithm, raising the approximation ratio from to . The algorithm maximizes the so-called multilinear extension of objective , thereby obtaining a fractional solution in the convex hull of the constraint space. The resulting solution is then rounded to produce an integral solution.
V-A Algorithm Overview
Formally, the multilinear extension of the caching gain is defined as follows. Define the convex hull of the set defined by the constraints (9b) in MaxCG as:
[TABLE]
Intuitively, is a fractional vector in satisfying the capacity constraints, and the bound .
Given a , consider a random vector in generated as follows: for all and , the coordinates are independent Bernoulli variables such that . The multilinear extension of is defined via following expectation , parameterized by , i.e.,
[TABLE]
The continuous-greedy algorithm, summarized in Alg. 2, proceeds by first producing a fractional vector . Starting from , the algorithm iterates over:
[TABLE]
for an appropriately selected step size . Intuitively, this yields an approximate solution to the non-convex problem:
[TABLE]
Even though (13) is not convex, the output of Alg. 2 is within a factor from the optimal solution of (13). This fractional solution can be rounded to produce a solution to MaxCG with the same approximation guarantee using either the pipage rounding [20] or the swap rounding [1, 33] schemes: we review both in Appendix E.
A Sampling-Based Estimator. Function , given by (11), involves a summation over terms, and cannot be easily computed in polynomial time. Typically, a sampling-based estimator (see, e.g., [1]) is used instead. Function is linear when restricted to each coordinate , for some , (i.e., when all inputs except are fixed). As a result, the partial derivative of w.r.t. can be written as:
[TABLE]
where the last inequality is due to monotonicity of . One can thus estimate the gradient by (a) producing random samples , of the random vector , consisting of independent Bernoulli coordinates, and (b) computing, for each pair , the average
[TABLE]
where , are equal to vector with the -th coordinate set to 1 and 0, respectively. Using this estimate, Alg. 2 attains an approximation ratio arbitrarily close to for appropriately chosen .
In particular, the following theorem holds:
Theorem 2**.**
[Calinescu et al. [1]] Consider Alg. 2, with replaced by the sampling-based estimate , given by (15). Set , and , where Then, the algorithm terminates after steps and, with high probability,
[TABLE]
where is an optimal solution to (13).
The proof of the theorem can be found in Appendix A of Calinescu et al. [1] for general submodular functions over arbitrary matroid constraints; we state Thm. 2 here with constants and set specifically for our objective and our set of constraints .
Under this parametrization of and , Alg. 2 runs in polynomial time. More specifically, note that is polynomial in the input size. Moreover, the algorithm runs for iterations in total. Each iteration requires samples, each involving a polynomial computation (as can be evaluated in polynomial time). Finally, LP (12a) can be solved in polynomial time in the number of constraints and variables, which are .
V-B A Novel Estimator via Taylor Expansion
The classic approach to estimate the gradient via sampling has certain drawbacks. The number of samples required to attain the ratio is quadratic in . In practice, even for networks and catalogs of moderate size (say, ), the number of samples becomes prohibitive (of the order of ). Producing an estimate for via a closed form computation that eschews sampling thus has significant computational advantages. In this section, we show that the multilinear relaxation of the caching gain admits such a closed-form characterization.
We say that a polynomial is in Weighted Disjunctive Normal Form (W-DNF) if it can be written as
[TABLE]
for some index set , positive coefficients , and index sets . Intuitively, treating binary variables as boolean values, each W-DNF polynomial can be seen as a weighted sum (disjunction) among products (conjunctions) of negative literals. These polynomials arise naturally in the context of our problem; in particular:
Lemma 2**.**
For all , , and , is a W-DNF polynomial whose coefficients depend on .
Proof (Sketch).
The lemma holds for by (2) and (3). The lemma follows by induction, as W-DNF polynomials over binary are closed under multiplication; this is because for all when . ∎
Hence, all load powers are W-DNF polynomials; a detailed proof be found in Appendix F. Expectations of W-DNF polynomials have a remarkable property:
Lemma 3**.**
Let be a W-DNF polynomial, and let be a random vector of independent Bernoulli coordinates parameterized by . Then , where is the evaluation of the W-DNF polynomial representing over the real vector .
Proof.
As is W-DNF, it can be written as
[TABLE]
for appropriate , and appropriate , where . Hence,
[TABLE]
Lemma 3 states that, to compute the expectation of a W-DNF polynomial over i.i.d. Bernoulli variables with expectations , it suffices to evaluate over input . Expectations computed this way therefore do not require sampling.
We leverage this property to approximate by taking the Taylor expansion of the cost functions at each edge . This allows us to write as a power series w.r.t. , ; from Lemmas 2 and 3, we can compute the expectation of this series in a closed form. In particular, by expanding the series and rearranging terms it is easy to show the following lemma, which is proved in Appendix G:
Lemma 4**.**
Consider a cost function which satisfies Assumption 1 and for which the Taylor expansion exists at some . Then, for a random Bernoulli vector parameterized by ,
[TABLE]
where, for and the error of the approximation is: \textstyle\frac{1}{(L+1)!}\sum_{e\in E}C^{(L+1)}_{e}(\rho^{\prime})\Big{[}\mathbb{E}_{[\mathbf{y}]_{-{v,i}}}[(\rho_{e}(\mathbf{x},\bm{\lambda})-\rho^{*})^{L+1}]\textstyle-\mathbb{E}_{[\mathbf{y}]_{+{v,i}}}[(\rho_{e}(\mathbf{x},\bm{\lambda})-\rho^{*})^{L+1}]\Big{]}.
Estimator (17) is deterministic: no random sampling is required. Moreover, Taylor’s theorem allows us to characterize the error (i.e., the bias) of this estimate. We use this to characterize the final fractional solution produced by Alg. 2:
Theorem 3**.**
Assume that all , , satisfy Assumption 1, are -differentiable, and that all their derivatives are bounded by . Then, consider Alg. 2, in which is estimated via the Taylor estimator (17), where each edge cost function is approximated at Then,
[TABLE]
where is the number of iterations, is an optimal solution to (13), is the diameter of , is the bias of estimator (17), and is a Lipschitz constant of .
The proof can be found in Appendix H. The theorem immediately implies that we can replace (17) as an estimator in Alg. 2, and attain an approximation arbitrarily close to .
Estimation via Power Series. For arbitrary -differentiable cost functions , the estimator (17) can be leveraged by replacing with its Taylor expansion. In the case of queue-dependent cost functions, as described in Example 4 of Section III-B, the power-series (8) can be used instead. For example, the expected queue size (Example 1, Sec. III-B), is given by In contrast to the Taylor expansion, this power series does not depend on a point around which the function is approximated.
VI Beyond M/M/1 queues
As discussed in Section III, the classes of M/M/1 queues for which the supermodularity of the cost functions arises is quite broad, and includes FIFO, LIFO, and processor sharing queues. In this section, we discuss how our results extend to even broader families of queuing networks. Chapter 3 of Kelly [2] provides a general framework for a set of queues for which service times are exponentially distributed; for completeness, we also summarize this in Appendix I. A large class of networks can be modeled by this framework, including networks of M/M/ queues; all such networks maintain the property that steady-state distributions have a product form. This allows us to extend our results to M/M/ queues for two cost functions :
Lemma 5**.**
For a network of M/M/k queues, both the queuing probability222This is given by the so-called Erlang C formula [31]. and the expected queue size are non-increasing and supermodular over sets .
We note that, as an immediate consequence of Lemma 5 and Little’s theorem, both the sum of the expected delays per queue, but also the expected delay of an arriving packet, are also supermodular and non-decreasing.
Product-form steady-state distributions arise also in settings where service times are not exponentially distributed. A large class of quasi-reversible queues, named symmetric queues exhibit this property (c.f. Section 3.3 of [2] and Chapter 10 of [4]). For completeness, we again summarize symmetric queues in Appendix K. In the following lemma we leverage the product form of symmetric queues to extend our results to M/D/1 symmetric queues [31].
Lemma 6**.**
For a network of M/D/1 symmetric queues, the expected queue size is non-increasing and supermodular over sets .
Again, Lemma 6 and Little’s theorem imply that this property also extends to network delays. It is worth noting that conclusions similar to these in Lemmas 5 and 6 are not possible for all general queues with product form distributions. In particular, also we prove the following negative result:
Lemma 7**.**
There exists a network of M/M/1/k queues, containing a queue , for which no strictly monotone function of the load at a queue is non-increasing and supermodular over sets . In particular, the expected size of queue is neither monotone nor supermodular.
VII Numerical Evaluation
Networks. We execute Algorithms 1 and 2 over 9 network topologies, summarized in Table II. Graphs ER and ER-20Q are the same 100-node Erdős-Rényi graph with parameter . Graphs HC and HC-20Q are the same hypercube graph with 128 nodes, and graph star is a star graph with 100 nodes. The graph path is the topology shown in Fig. 2. The last 3 topologies, namely, dtelekom, geant, and abilene represent the Deutsche Telekom, GEANT, and Abilene backbone networks, respectively. The latter is also shown in Fig. 3.
Experimental Setup. For path and abilene, we set demands, storage capacities, and service rates as illustrated in Figures 2 and 3, respectively. Both of these settings induce an approximation ratio close to for greedy. For all remaining topologies, we consider a catalog of size objects; for each object, we select 1 node uniformly at random (u.a.r.) from to serve as the designated server for this object. To induce traffic overlaps, we also select nodes u.a.r. that serve as sources for requests; all requests originate from these sources. All caches are set to the same storage capacity, i.e., for all . We generate a set of possible types of requests. For each request type , request per second, and path is generated by selecting a source among the sources u.a.r., and routing towards the designated server of object using a shortest path algorithm. We consider two ways of selecting objects : in the uniform regime, is selected u.a.r. from the catalog ; in the power-law regime, is selected from the catalog via a power law distribution with exponent . All the parameter values, e.g., catalog size , number of requests , number of query sources , and caching capacities are presented in Table II.
We construct heterogeneous service rates as follows. Every queue service rate is either set to a low value or a high value for all We select and as follows. Given the demands and the corresponding arrival rates , we compute the highest load under no caching (), i.e., we find We then set and . We set the service rate to for all congested edges, i.e., edges s.t. . We set the service rate for each remaining edge to independently with probability 0.7, and to otherwise. Note that, as a result , i.e., the system is stable even in the absence of caching and, on average, 30 percent of the edges have a high service rate.
Placement Algorithms. We implement several placement algorithms: (a) Greedy, i.e., the greedy algorithm (Alg. 1), (b) Continuous-Greedy with Random Sampling (CG-RS), i.e., Algorithm 2 with a gradient estimator based on sampling, as described in Sec. V-A, (c) Continuous-Greedy with Taylor approximation (CGT), i.e., Algorithm 2 with a gradient estimator based on the Taylor expansion, as described in Sec. V-B, and (d) Continuous-Greedy with Power Series approximation (CG-PS), i.e., Algorithm 2 with a gradient estimator based on the power series expansion, described also in Sec. V-B. In the case of CG-RS, we collect 500 samples, i.e., . In the case of CG-PS we tried the first and second order expansions of the power series as CG-PS1 and CG-PS2, respectively. In the case of CGT, we tried the first-order expansion . In both cases, subsequent to the execution of Alg. 2 we produce an integral solution in by rounding via the swap rounding method [33]. All continuous-greedy algorithms use We also implement a random selection algorithm (RND), which caches items at each node , selected uniformly at random. We repeat RND 10 times, and report the average running time and caching gain.
Caching Gain Across Different Topologies. The caching gain for generated by different placement algorithms, is shown for power-law arrival distribution and uniform arrival distribution in Figures 4a and 4b, respectively. The values are normalized by the gains obtained by RND, reported in Table II. Also, the running times of the algorithms for power-law arrival distribution are reported in Fig. 5. As we see in Fig. 4, Greedy is comparable to other algorithms in most topologies. However, for topologies path and abilene Greedy obtains a sub-optimal solution, in comparison to the continuous-greedy algorithm. In fact, for path and abilene Greedy performs even worse than RND. In Fig. 4, we see that the continuous-greedy algorithms with gradient estimators based on Taylor and Power series expansion, i.e., CG-PS1, CG-PS2, and CGT outperform CG-RS500 in most topologies. Also, from Fig. 5, we see that CG-RS500 runs 100 times slower than the continuous-greedy algorithms with first-order gradient estimators, i.e., CG-PS1 and CGT. Note that 500 samples are significantly below the value, stated in Theorem 2, needed to attain the theoretical guarantees of the continuous-greedy algorithm, which is quadratic in .
Varying Service Rates. For topologies path and abilene, the approximation ratio of Greedy is This ratio is a function of service rate of the high-bandwidth link In this experiment, we explore the effect of varying on the performance of the algorithms in more detail. We plot the caching gain obtained by different algorithms for path and abilene topologies, using different values of where is the value that puts the system on the brink of instability, i.e., 1 and for path and abilene, respectively. Thus, we gradually increase the discrepancy between the service rate of low-bandwidth and high-bandwidth links. The corresponding caching gains are plotted in Fig. 6, as a function of . We see that as increases the gain attained by Greedy worsens in both topologies: when Greedy matches the performance of the continuous-greedy algorithms, in both cases. However, for higher values of it is beaten not only by all variations of the continuous-greedy algorithm, but by RND as well.
Effect of Congestion on Caching Gain. In this experiment, we study the effect of varying arrival rates on caching gain . We report results only for the dtelekom and ER topologies and power-law arrival distribution. We obtain the cache placements using the parameters presented in Table II and different arrival rates: for . Fig. 7 shows the caching gain attained by the placement algorithms as a function of arrival rates. We observe that as we increase the arrival rates, the caching gain attained by almost all algorithms, except RND, increases significantly. Moreover, CG-PS1, CG-PS2, CGT, and Greedy have a similar performance, while CG-RS500 achieves lower caching gains.
Varying Caching Capacity. In this experiment, we study the effect of increasing cache capacity on the acquired caching gains. Again, we report the results only for the dtelekom and ER topologies and power-law arrival distribution. We evaluate the caching gain obtained by different placement algorithms using the parameters of Table II and different caching capacities: for The caching gain is plotted in Fig. 8. As we see, in all cases the obtained gain increases, as we increase the caching capacities. This is expected: caching more items reduces traffic and delay, increasing the gain.
VIII Conclusions
Our analysis suggests feasible object placements targeting many design objectives of interest, including system size and delay, can be determined using combinatorial techniques. Our work leaves the exact characterization of approximable objectives for certain classes of queues, including M/M/1/k queues, open. Our work also leaves open problems relating to stability. This includes the characterization of the stability region of arrival rates . It is not clear whether determining membership in this set (or, equivalently, given , determining whether there exists a under which the system is stable) is NP-hard or not, and whether this region can be somehow approximated. Finally, all algorithms presented in this paper are offline: identifying how to determine placements in an online, distributed fashion, in a manner that attains a design objective (as in [8, 21]), or even stabilizes the system (as in [7]), remains an important open problem.
IX Acknowledgements
The authors gratefully acknowledge support from National Science Foundation grant NeTS-1718355, as well as from research grants by Intel Corp. and Cisco Systems.
Appendix A Kelly Networks
Kelly networks [2, 4, 3] (i.e., multi-class Jackson networks) are networks of queues operating under a fairly general service disciplines (including FIFO, LIFO, and processor sharing, to name a few). As illustrated in Fig. 1(a), a Kelly network can be represented by a directed graph , in which each edge is associated with a queue. In the case of First-In First-Out (FIFO) queues, each edge/link is associated with an M/M/1 queue with service rate . In an open network, packets of exogenous traffic arrive, are routed through consecutive queues, and subsequently exit the network; the path followed by a packet is determined by its class.
Formally, let be the set of packet classes. For each packet class , we denote by the simple path of adjacent nodes visited by a packet. Packets of class arrive according to an exogenous Poisson arrival process with rate , independent of other arrival processes and service times. Upon arrival, a packet travels across nodes in , traversing intermediate queues, and exits upon reaching the terminal node in .
A Kelly network forms a Markov process over the state space determined by queue contents. In particular, let be the number of packets of class in queue , and be the total queue size. The state of a queue , , is the vector of length representing the class of each packet in each position of the queue. The system state is then given by ; we denote by the state space of this Markov process.
The aggregate arrival rate at an edge is given by , while the load at edge is given by . Kelly’s extension of Jackson’s theorem [2] states that, if for all , the Markov process is positive recurrent, and its steady-state distribution has the following product form:
[TABLE]
where
[TABLE]
As a consequence, the queue sizes , , also have a product form distribution in steady state, and their marginals are given by:
[TABLE]
The steady-state distribution (19) holds for many different service principles beyond FIFO (c.f. Section 3.1 of [2] and Appendix I). In short, incoming packets can be placed in random position within the queue according to a given distribution, and the (exponentially distributed) service effort can be split across different positions, possibly unequally; both placement and service effort distributions are class-independent. This captures a broad array of policies including FIFO, Last-In First-Out (LIFO), and processor sharing: in all cases, the steady-state distribution is given by (19).
Appendix B Monotone Separable Costs
Consider the state-dependent cost functions introduced in Section III-B. The cost at state can be written as Hence On the other hand, as , we have that
[TABLE]
As is non-decreasing, for all . On the other hand, for all , is a convex non-decreasing function of in , so is a convex function of as a positively weighted sum of convex non-decreasing functions. ∎
Appendix C Proof of Theorem 1
We first prove the following auxiliary lemma:
Lemma 8**.**
Let be a convex and non-decreasing function. Also, let be a non-increasing supermodular set function. Then is also supermodular.
Proof.
Since is non-increasing, for any we have
[TABLE]
[TABLE]
Due to supermodularity of , we can find , such that
[TABLE]
[TABLE]
Then, we have
[TABLE]
where the first inequality is due to convexity of , and the second one is because and is non-increasing. This proves is supermodular. ∎
To conclude the proof of Thm. 1, observe that it is easy to verify that , is supermodular and non-increasing in . Since, by Assumption 1, is a non-decreasing function, then, is non-increasing. By Lemma 8, is also supermodular. Hence, the cost function is non-increasing and supermodular as the sum of non-increasing and supermodular functions.
Appendix D Proof of Lemma 1
Consider the path topology illustrated in Fig. 2. Assume that requests for files 1 and 2 are generated at node with rates , for some . Files 1 and 2 are stored permanently at and , respectively. Caches exist only on and , and have capacity . Edges , have bandwidth , while edge is a high bandwidth link, having capacity . Let . The greedy algorithm starts from empty caches and adds item 2 at cache . This is because the caching gain from this placement is , while the caching gain of all other decisions is at most . Any subsequent caching decisions do not change the caching gain. The optimal solution is to cache item 1 at and item 2 at , yielding a caching gain of . Hence, the greedy solution attains an approximation ratio By appropriately choosing and , this can be made arbitrarily close to 0.5. ∎
Appendix E Rounding
Several poly-time algorithms can be used to round the fractional solution that is produced by Alg. 2 to an integral . We briefly review two such rounding algorithms: pipage rounding [20], which is deterministic, and swap-rounding [33], which is randomized. For a more rigorous treatment, we refer the reader to [20, 8] for pipage rounding, and [33] for swap rounding.
Pipage rounding uses the following property of : given a fractional solution , there are at least two fractional variables and , such that transferring mass from one to the other, makes at least one of them 0 or 1, the new remains feasible in , and , that is, the expected caching gain at is at least as good as . This process is repeated until does not have any fractional element, at which point pipage rounding terminates and return . This procedure has a run-time of [8], and since each rounding step can only increase , it follows that the final integral must satisfy
[TABLE]
where is an optimal solution to MaxCG. Here, the first equality holds because and are equal when their arguments are integral, while the last inequality holds because (13) is a relaxation of MaxCG, maximizing the same objective over a larger domain.
In swap rounding, given a fractional solution produced by Alg. 2 observe that it can be written as a convex combination of integral vectors in , i.e., where and .Moreover, by construction, each such vector is maximal, in that all capacity constraints are tight. Swap rounding iteratively merges these constituent integral vectors, producing an integral solution. At each iteration , the present integral vector is merged with into a new integral solution as follows: if the two solutions , differ at a cache , items in this cache are swapped to reduce the set difference: either an item in a cache in replaces an item in , or an item in replaces an item in ; the former occurs with probability proportional to , and the latter with probability proportional to . The swapping is repeated until the two integer solutions become identical; this merged solution becomes . This process terminates after steps, after which all the points are merged into a single integral vector Observe that, in contrast to pipage rounding, swap rounding does not require any evaluation of the objective during rounding. This makes swap rounding significantly faster to implement; this comes at the expense of the approximation ratio, however, as the resulting guarantee is in expectation.
Appendix F Proof of Lemma 2
We prove this by induction on . Observe first that, by (3), the load on each edge can be written as a polynomial of the following form:
[TABLE]
for appropriately defined
[TABLE]
In other words, is indeed a W-DNF polynomial. For the induction step, observe that W-DNF polynomials, seen as functions over the integral domain , are closed under multiplication. In particular, the following lemma holds:
Lemma 9**.**
Given two W-DNF polynomials and , given by
[TABLE]
their product is also a W-DNF polynomial over , given by:
[TABLE]
Proof.
To see this, observe that
[TABLE]
where is the symmetric set difference. On the other hand, as , we have that , and the lemma follows. ∎
Hence, if is a W-DNF polynomial, by (23) and Lemma 9, so is .∎
Appendix G Proof of Lemma 4
The Taylor expansion of at is given by:
[TABLE]
where and is the -th order derivative of . By expanding this polynomial and reorganizing the terms, we get
[TABLE]
where
[TABLE]
for Consider now the -th order Taylor approximation of , given by
[TABLE]
Clearly, this is an estimator of , with an error of the order Thus, for a random Bernoulli vector parameterized by ,
[TABLE]
On the other hand, for all and :
[TABLE]
where the error of the approximation is given by
[TABLE]
The lemma thus follows from Lemmas 2 and 3.
Appendix H Proof of Theorem 3
We begin by bounding the bias of estimator (LABEL:eq:taylorapprox). Indeed, given a set of continuous functions where their first derivatives within their operating regime, , are upperbounded by a finite constant, , the bias of estimator , where is defined by (17), is given by
[TABLE]
where . To compute the bias, we note that . Specifically, we assume . Hence, , and . In particular, let . Then, it is easy to compute the following upper bound on the bias of :
[TABLE]
In addition, note that is linear in , and hence [1]:
[TABLE]
which is due to monotonicity of . It is easy to verify that . For , we can compute the second derivative of [1] as given by
[TABLE]
which is due to the supermodularity of . Hence, is component-wise concave [1] .
In additions, it is easy to see that for , , , and are bounded by , and , respectively. Consequently, and are -Lipschitz continuous, with .
In the th iteration of the Continuous Greedy algorithm, let , where . Since and is closed-down, . Due to monotonicity of , it follows
[TABLE]
We introduce univariate auxiliary function . Since is component-wise concave, then, is concave in . In addition, since is concave for , it follows
[TABLE]
Now let be the vector chosen by Algorithm 2 in the th iteration. We have
[TABLE]
For the LHS, we have
[TABLE]
where , is the upperbound on the diameter of , is as defined in (27), and (i) follows from Cauchy-Schwarz inequality. Similarly, we have for the RHS of that (31)
[TABLE]
It follows
[TABLE]
where follows from (30), and follows from (29).
Using the -Lipschitz continuity property of (due to -Lipschitz continuity of ), it is straightforward to see that
[TABLE]
hence,
[TABLE]
where follows from (34), respectively. By rearranging the terms and letting , we have
[TABLE]
where is true since , and holds due to the greedy nature of Algorithm 2 and monotonicity of . In addition, Algorithm 2 ensures . It follows
[TABLE]
This result holds for general stepsizes . The RHS of (37) is indeed maximized when , which is the assumed case in Algorithm 2. In addition, we have , and hence, . Therefore, we have
[TABLE]
Appendix I General Kelly Networks
In Kelly’s network of queues (see Section 3.1 of [2] for more information), queue , assuming it contains packets in the queue, operates in the following manner:
Each packet (customer) requires an exponentially distributed amount of service. 2. 2.
A total service effort is provided by queue at the rate . 3. 3.
The packet in position in the queue is provided with a portion of the total service effort, for ; when this packet completes service and leaves the queue, packets in positions move down to positions , respectively. 4. 4.
An arriving packet at queue moves into position , for , with probability ; packets that where in positions , move up to positions , respectively.
Clearly, we require for ; in addition,
[TABLE]
[TABLE]
Kelly’s theorem [2] states that, if for all , the state of queue in equilibrium is independent of the rest of the system, hence, it will have a product form. In addition, the probability that queue contains packets is
[TABLE]
where is the normalizing factor. As can be seen from (41), note that the steady-state distribution is not function of ’s, and ’s, and hence, is independent of the packet placement and service allocation distributions.
We note that by allowing , we obtain the results in (21).
Appendix J Proof of Lemma 5
For an arbitrary network of M/M/k queues, the traffic load on queue is given as
[TABLE]
which is similar to that of M/M/1 queues, but normalized by the number of servers, . Hence, is submodular in . For an M/M/k queue, the probability that an arriving packet finds all servers busy and will be forced to wait in queue is given by Erlang C formula [31], which follows
[TABLE]
where
[TABLE]
is the normalizing factor. In addition, the expected number of packets waiting for or under transmission is given by
[TABLE]
Lee and Cohen in [34], shows that and are strictly increasing and convex in , for . In addition, a more direct proof of convexity of was shown by Grassmann in [35]. Hence, Both and are increasing and convex. Due to Theorem 1, we note that both functions are non-increasing and supermodular in , and the proof is complete.
Appendix K Networks of Symmetric Queues
Let be the number of packets placed in positions in queue . Queue is defined as symmetric queue if it operates in the following manner
The service requirement of a packet is a random variable whose distribution may depend upon the class of the customer. 2. 2.
A total service effort is provided by queue at the rate . 3. 3.
The packet in position in the queue is provided with a portion of the total service effort, for ; when this packet completes service and leaves the queue, packets in positions move down to positions , respectively. 4. 4.
An arriving packet at queue moves into position , for , with probability ; packets that where in positions , move up to positions , respectively.
Similarly, we require for ; in addition,
[TABLE]
As shown in [2], and [4], symmetric queues have product form steady-state distributions. In particular, it turns out the probability of there are packets in queue is similar to that given by (41).
Appendix L Proof of Lemma 6
Let be the traffic load on queue , as defined by (3). It can be shown that the average number of packets in queue is of form [31]
[TABLE]
It is easy to see that this function is strictly increasing and convex in for . Due to Theorem 1, is non-increasing and supermodular in , and the proof is complete.
Appendix M Proof of Lemma 7
Consider the network of queues in Fig. 9, where node 1 is requesting content 1 from node 3, according to a Poisson process with rate . For simplicity, we only consider the traffic for content 1. For queues and , it is easy to verify that the probability of packet drop at queues is given by
[TABLE]
where is the traffic load on queue , and it can be computed for and as follows:
[TABLE]
[TABLE]
Using the results reported in Table III, it is easy to verify that ’s are not monotone in . Hence, no strictly monotone function of ’s are monotone in . In addition, it can be verified that ’s are neither submodular, nor supermodular in . To show this, let sets , and , correspond to caching configurations and , respectively. Note that , and . Since then is not submodular. Consequently, no strictly monotone function of is submodular. Similarly, as is not supermodular. Thus, no strictly monotone function of is supermodular.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] G. Calinescu, C. Chekuri, M. Pál, and J. Vondrák, “Maximizing a monotone submodular function subject to a matroid constraint,” SIAM Journal on Computing , vol. 40, no. 6, pp. 1740–1766, 2011.
- 2[2] F. P. Kelly, Reversibility and stochastic networks . Cambridge University Press, 2011.
- 3[3] R. G. Gallager, Stochastic processes: theory for applications . Cambridge University Press, 2013.
- 4[4] R. Nelson, Probability, Stochastic Processes, and Queueing Theory: The Mathematics of Computer Performance Modeling , 1st ed. Springer Publishing Company, Incorporated, 2010.
- 5[5] H. Chen and D. D. Yao, Fundamentals of queueing networks: Performance, asymptotics, and optimization . Springer Science & Business Media, 2013, vol. 46.
- 6[6] V. Jacobson, D. K. Smetters, J. D. Thornton, M. F. Plass, N. H. Briggs, and R. L. Braynard, “Networking named content,” in Co NEXT , 2009.
- 7[7] E. Yeh, T. Ho, Y. Cui, M. Burd, R. Liu, and D. Leong, “VIP: A framework for joint dynamic forwarding and caching in named data networks,” in ICN , 2014.
- 8[8] S. Ioannidis and E. Yeh, “Adaptive caching networks with optimality guarantees,” in SIGMETRICS , 2016.
