Video Streaming in Distributed Erasure-coded Storage Systems: Stall Duration Analysis
Abubakr O. Al-Abbasi, Vaneet Aggarwal

TL;DR
This paper analyzes stall durations in distributed erasure-coded cloud video streaming, providing bounds on stall metrics and proposing an optimization approach to enhance user experience.
Contribution
It is the first work to model and analyze video streaming over erasure-coded distributed cloud systems, introducing bounds on stall durations and an optimization method for QoE improvement.
Findings
Significant reduction in mean stall duration.
Lower tail probability of stall durations.
Improved QoE metrics compared to baselines.
Abstract
The demand for global video has been burgeoning across industries. With the expansion and improvement of video-streaming services, cloud-based video is evolving into a necessary feature of any successful business for reaching internal and external audiences. This paper considers video streaming over distributed systems where the video segments are encoded using an erasure code for better reliability thus being the first work to our best knowledge that considers video streaming over erasure-coded distributed cloud systems. The download time of each coded chunk of each video segment is characterized and ordered statistics over the choice of the erasure-coded chunks is used to obtain the playback time of different video segments. Using the playback times, bounds on the moment generating function on the stall duration is used to bound the mean stall duration. Moment generating function…
| Node 1 | Node 2 | Node 3 | Node 4 | Node 5 | Node 6 | |
|---|---|---|---|---|---|---|
| Symbol | Meaning |
|---|---|
| Number of video files in system | |
| Number of storage nodes | |
| Number of segments for video file | |
| Segment of video file | |
| Erasure code parameters for file | |
| coded chunk of segment in file | |
| Possion arrival rate of file | |
| Probability of retrieving chunk of file from node using probabilistic scheduling algorithm | |
| Set of storage nodes having coded chunks of file | |
| Set of storage nodes used to access chunks from file | |
| Parameters of Shifted Exponential distribution | |
| Service time distribution of a chunk at node | |
| Moment generating function for the service time of a chunk at node | |
| Parameter indexing stall duration tail probability | |
| Download time for coded chunk of file from storage node | |
| Service time of the video files | |
| Laplace-Stieltjes Transform of , | |
| Moment generating function for the service time of video files | |
| Mean service time of a chunk from storage node | |
| Aggregate arrival rate at node | |
| Video file request intensity at node | |
| The time at which the segment is played back | |
| Start-up delay | |
| Chunk size in seconds | |
| Stall duration tail probability for file | |
| Trade off factor between mean stall duration and stall duration tail probability |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Video Streaming in Distributed Erasure-coded Storage Systems: Stall Duration Analysis
Abubakr O. Al-Abbasi and Vaneet Aggarwal The authors are with the School of Industrial Engineering, Purdue University, West Lafayette IN 47907, email: {aalabbas,vaneet}@purdue.edu. This work was supported in part by the National Science Foundation under Grant no. CNS-1618335.
Abstract
The demand for global video has been burgeoning across industries. With the expansion and improvement of video-streaming services, cloud-based video is evolving into a necessary feature of any successful business for reaching internal and external audiences. This paper considers video streaming over distributed systems where the video segments are encoded using an erasure code for better reliability thus being the first work to our best knowledge that considers video streaming over erasure-coded distributed cloud systems. The download time of each coded chunk of each video segment is characterized and ordered statistics over the choice of the erasure-coded chunks is used to obtain the playback time of different video segments. Using the playback times, bounds on the moment generating function on the stall duration is used to bound the mean stall duration. Moment generating function based bounds on the ordered statistics are also used to bound the stall duration tail probability which determines the probability that the stall time is greater than a pre-defined number. These two metrics, mean stall duration and the stall duration tail probability, are important quality of experience (QoE) measures for the end users. Based on these metrics, we formulate an optimization problem to jointly minimize the convex combination of both the QoE metrics averaged over all requests over the placement and access of the video content. The non-convex problem is solved using an efficient iterative algorithm. Numerical results show significant improvement in QoE metrics for cloud-based video as compared to the considered baselines.
Index Terms:
Distributed Storage, Erasure Codes, Stall Duration, Tail Latency, Repetition Coding, Video Streaming
I Introduction
The demands of video streaming services have been skyrocketing over these years, with the global video streaming market expected to grow annually at a rate of 18.3% [1]. With the proliferation and advancement of video-streaming services, cloud-based video has become an imperative feature of any successful business. This can also be seen as IBM estimates cloud-based video will be a $105 billion market opportunity by 2019 [2]. In cloud storage systems, erasure coding has seen itself quickly emerged as a promising technique to reduce the storage cost for a given reliability as compared to the replicated systems [3, 4]. It has been widely adopted in modern storage systems by companies like Facebook [5], Microsoft [6], and Google [7]. This paper considers video streaming when the content is placed on cloud servers, where erasure coding is used. The key quality of experience (QoE) metric for video streaming is the duration of stalls at the clients. This paper gives bounds on the stall durations, and uses that to propose an optimized streaming service that minimizes average QoE for the clients.
In this paper, we consider two measures of QoE metrics in terms of stall duration. The first is the mean stall duration. Almost every viewer can relate to the quality of experiences for watching videos being the stall duration and is thus one of the key focus in the studied streaming algorithms [8, 9]. The second is the probability that the stall duration is greater than a fixed number , which determines the stall duration tail probability. It has been shown that in modern Web applications such as Bing, Facebook, and Amazon’s retail platform, the long tail of latency is of particular concern, with th percentile response times that are orders of magnitude worse than the mean [10, 11]. Thus, the QoE metric of stall duration tail probability becomes important. This paper characterizes an upper bound on both QoE metrics.
We note that quantifying service latency for erasure-coded storage is an open problem [12], and so is tail latency [13]. This paper takes a step forward and explores the notions for video streaming rather than video download. Thus, finding the exact QoE metrics is an open problem. This paper finds the bounds on the QoE metrics. The data chunk transfer time in practical systems follows a shifted exponential distribution [14, 15] which motivates the choice that the service time distribution for each video server is a shifted exponential distribution. Further, the request arrival rates for each video is assumed to be Poisson. The video segments are encoded using an erasure code and the coded segments are placed on different servers. When a video is requested, the segments need to be requested from out of servers. Optimal strategy of choosing these servers would need a Markov approach similar to that in [12] and suffers from a similar state explosion problem, because states of the corresponding queuing model must encapsulate not only a snapshot of the current system including chunk placement and queued requests but also past history of how chunk requests have been processed by individual nodes.
In this paper, we use the probabilistic scheduling proposed in [16, 14] to access the servers, where each possibility of servers is chosen with certain probability and the probability terms can be optimized. Using this scheduling mechanism, the random variables corresponding to the times for download of different video segments from each server are found. Using ordered statistics over the servers, the random variables corresponding to the playback time of each video segment are characterized. These are then used to find bounds on the mean stall duration and the stall duration tail probability. Moment generating functions of the ordered statistics of different random variables are used in the bounds. We note that the problem of finding latency for file download is very different from the video stall duration for streaming. This is because the stall duration accounts for download time of each video segment rather than only the download time of the last video segment. Further, the download time of segments are correlated since the download of chunks from a server are in sequence and the playback time of a video segment are dependent on the playback time of the last segment and the download time of the current segment. Taking these dependencies into account, this paper characterizes the bounds on the two QoE metrics. We note that for the special case when each video has a single segment, the bounds on mean stall duration and stall duration tail probability reduce to that for file download. Further, the bounds based on the approach in this paper have been shown to outperform the results for mean file download latency in [16, 14].
The proposed framework provides a mathematical crystallization of the engineering artifacts involved and illuminates key system design issues through optimization of QoE. The average QoE metric over different requests can be optimized over the placement of the video files, the access of the video files from the servers, and the bound parameters. The tradeoff in the two QoE metrics is captured by defining the objective function which is a convex combination of the two QoE metrics. Varying the parameter trading off the two metrics can be used to get a tradeoff region between the two metrics helping the system designer to choose an appropriate point. An efficient algorithm is proposed to solve the proposed non-convex problem. The proposed algorithm does an alternating optimization over the placement, access, and the bound parameters. The optimization over probabilistic scheduling access parameters help reduce the mean and tail of the stall durations by differentiating video files thus providing more flexibility as compared to choosing the lowest queue servers.
The sub-problems have been shown to have convex constraints and thus can be efficiently solved using iNner cOnVex Approximation (NOVA) algorithm proposed in [17]. The proposed algorithm is shown to converge to a local optimal. Numerical results demonstrate significant improvement of QoE metrics as compared to the baselines.
Today, cloud-based video does not use erasure coding. One of the key reason is the additional decoding latency from multiple coded streams. Since the computing has been growing exponentially [18], it is only a matter of time when the computation of decoding will not limit the latencies in delay sensitive video streaming and the networking latency will govern the system designs. Further, we note that replication is a special case of erasure coding. Thus, the proposed research using erasure-coded content on the servers can also be used when the content is replicated on the servers.
The key contributions of our paper include:
- •
This paper formulates video streaming over erasure-coded cloud storage system.
- •
The random variable corresponding to the download time of a chunk of each video segment from a server is characterized. Using ordered statistics, the random variable corresponding to the playback time of each video segment is found. These are further used to derive upper bounds on the mean stall duration of the video and the video stall duration tail probability.
- •
The QoE metrics are used to formulate system optimization problems over the choice of the placement of video segments, probabilistic scheduling access policy and the bound parameters which are related to the moment generating function. Efficient iterative solutions are provided for these optimization problems.
- •
Numerical results show that the proposed algorithms converges within a few iterations. Further, the QoE metrics are shown to have significant improvement as compared to the considered baselines. For instance, the mean stall duration for the proposed algorithm is 60% smaller and the stall duration tail probability is orders of magnitude better as compared to random placement and projected equal access probability strategy.
The remainder of this paper is organized as follows. Section 2 provides related work for this paper. In Section 3, we describe the system model used in the paper with a description of video streaming over cloud storage. Section 4 derives expressions on the download and play times of the chunks which are used in Sections 5 and 6 to find the upper bounds on the QoE metrics of the mean stall duration and video stall latency, respectively. Section 7 formulates the QoE optimization problem as a weighted combination of the two QoE metrics and proposes the iterative algorithmic solution of this problem. Numerical results are provided in Section 8. Section 9 concludes the paper.
II Related Work
*Latency in Erasure-coded Storage: * To our best knowledge, however, while latency in erasure coded storage systems has been widely studied, quantifying exact latency for erasure-coded storage system in data-center network is an open problem. Prior works focusing on asymptotic queuing delay behaviors [19, 20] are not applicable because redundancy factor in practical data centers typically remains small due to storage cost concerns. Due to the lack of analytic latency models, most of the literature is focused on reliable distributed storage system design, and latency is only presented as a performance metric when evaluating the proposed erasure coding scheme, e.g., [21, 22], which demonstrate latency improvement due to erasure coding in different system implementations. Related design can also be found in data access scheduling [23, 24], access collision avoidance [25], and encoding/decoding time optimization [26] and there is also some work using the LT erasure codes to adjust the system to meet user requirements such as availability, integrity and confidentiality [27].
Recently, there has been a number of attempts at finding latency bounds for an erasure-coded storage system [16, 14, 15, 28, 12]. The key scheduling approaches include block-one-scheduling policy that only allows the request at the head of the buffer to move forward [29], fork-join queue [30, 28] to request data from all server and wait for the first to finish, and the probabilistic scheduling [16, 14] that allows choice of every possible subset of nodes with certain probability. Mean latency and tail latency have been characterized in [16, 14] and [13] respectively for a system with multiple files using probabilistic scheduling. This paper considers video streaming rather than file downloading. The metrics for video streaming does not only account for the end of the download of the video but also of the download of each of the segment. Thus, the analysis for the content download cannot be extended to the video streaming directly and the analysis approach in this paper is very different from the prior works in the area.
*Video Streaming over Cloud: * Servicing Video on Demand and Live TV Content from cloud servers have been studied widely [31, 32, 33, 34, 35]. The placement of content and resource optimization over the cloud servers have been considered. To the best of our knowledge, reliability of content over the cloud servers have not been considered for video streaming applications. In the presence of erasure-coding, there are novel challenges to characterize and optimize the QoE metrics at the end user. Adaptive streaming algorithms have also been considered for video streaming [36, 37], which are beyond the scope of this paper and are left for future work.
III System Model
We consider a distributed storage system consisting of heterogeneous servers (also called storage nodes), denoted by . Each video file , where is divided into equal segments, , each of length sec. Then, each segment for is partitioned into fixed-size chunks and then encoded using an Maximum Distance Separable (MDS) erasure code to generate distinct chunks for each segment . These coded chunks are denoted as . The encoding setup is illustrated in Figure 1.
The encoded chunks are stored on the disks of distinct storage nodes. These storage nodes are represented by a set , such that and . Each server stores all the chunks for all and for some . In other words, each of the storage nodes stores one of the coded chunks for the entire duration of the video. The placement on the servers is illustrated in Figure 2, where the server is shown to store first coded chunks of file , third coded chunks of file and first coded chunks for file .
The use of of MDS erasure code introduces a redundancy factor of which allows the video to be reconstructed from the video chunks from any subset of -out-of- servers. We note that the erasure-code can also help in recovery of the content as long as of the servers containing file are available [4]. Note that replication along servers is equivalent to choosing erasure code. Hence, when a video is requested, the request goes to a set of the storage nodes, where and . From each server , all chunks for all and the value of corresponding to that placed on server are requested. The request is illustrated in Figure 2. In order to play a segment of video , should have been downloaded from all . We assume that an edge router which is a combination of multiple users is requesting the files. Thus, the connections between the servers and the edge router is considered as the bottleneck. Since the service provider only has control over this part of the network and the last hop may not be under the control of the provider, the service provider can only guarantee the quality-of-service till the edge router. The key used notations are defined in Table II in Appendix A.
We assume that the files at each server are served in order of the request in a first-in-first-out (FIFO) policy. Further, the different chunks are processed in order of the duration. This is depicted in Figure 3, where for a server , when a file is requested, all the chunks are placed in the queue where other video requests before this that have not yet been served are waiting.
In order to schedule the requests for video file to the servers, the choice of -out-of- servers is important. Finding the optimal choice of these servers to compute the latency expressions is an open problem to the best of our knowledge. Thus, this paper uses a policy, called Probabilistic Scheduling, which was proposed in [16, 14]. This policy allows choice of every possible subset of nodes with certain probability. Upon the arrival of a video file , we randomly dispatch the batch of chunk requests to appropriate a set of nodes (denoted by set of servers for file ) with predetermined probabilities ( for set and file ). Then, each node buffers requests in a local queue and processes in order and independently as explained before. The authors of [16, 14] proved that a probabilistic scheduling policy with feasible probabilities exists if and only if there exists conditional probabilities satisfying
[TABLE]
In other words, selecting each node with probability would yield a feasible choice of . Thus, we consider the request probabilities as the probability that the request for video file uses server . While the probabilistic scheduling have been used to give bounds on latency of file download, this paper uses the scheduling to give bounds on the QoE for video streaming.
We note that it may not be ideal in practice for a server to finish one video request before starting another since that increases delay for the future requests. However, this can be easily alleviated by considering that each server has multiple queues (streams) to the edge router which can all be considered as separate servers. These multiple streams can allow multiple parallel videos from the server. The probabilistic scheduling can choose of the overall queues to access the content. Possible approaches of extension to accommodate such scenarios are shown in the Appendix K.
We now describe a queuing model of the distributed storage system. We assume that the arrival of client requests for each video form an independent Poisson process with a known rate . The arrival of file requests at node forms a Poisson Process with rate which is the superposition of Poisson processes each with rate .
We assume that the chunk service time for each coded chunk at server , , follows a shifted exponential distribution as has been demonstrated in realistic systems [14, 15]. The service time distribution for the chunk service time at server , , is given by the probability distribution function , which is
[TABLE]
We note that exponential distribution is a special case with . We note that the constant delays like the networking delay, and the decoding time can be easily factored into the shift of the shifted exponential distribution. Let be the moment generating function of . Then, is given as
[TABLE]
We note that the arrival rates are given in terms of the video files, and the service rate above is provided in terms of the coded chunks at each server. The client plays the video segment after all the chunks for the segment have been downloaded and the previous segment has been played. We also assume that there is a start-up delay of (in seconds) for the video which is the duration in which the content can be buffered but not played. This paper will characterize the stall duration and stall duration tail probability for this setting.
IV Download and Play Times of the Chunks
In order to understand the stall duration, we need to see the download time of different coded chunks and the play time of the different segments of the video.
IV-A Download Times of the Chunks from each Server
In this subsection, we will quantify the download time of chunk for video file from server which has chunks for all . We consider download of chunk . As seen in Figure 3, the download of consists of two components - the waiting time of all the video files in queue before file request and the service time of all chunks of video file up to the chunk. Let be the random variable corresponding to the waiting time of all the video files in queue before file request and be the (random) service time of coded chunk for file from server . Then, the (random) download time for coded chunk for file at server , , is given as
[TABLE]
We will now find the distribution of . We note that this is the waiting time for the video files whose arrival rate is given as . Since the arrival rate of video files is Poisson, the waiting time for the start of video download from a server , , is given by an M/G/1 process. In order to find the waiting time, we would need to find the service time statistics of the video files. Note that gives the service time distribution of only a chunk and not of the video files.
Video file consists of coded chunks at server (). The total service time for video file at server if requested from server , , is given as
[TABLE]
The service time of the video files is given as
[TABLE]
since the service time is when file is requested from server . Let be the Laplace-Stieltjes Transform of .
Lemma 1**.**
The Laplace-Stieltjes Transform of , is given as
[TABLE]
Proof.
The proof is provided in Appendix B. ∎
Corollary 1**.**
The moment generating function for the service time of video files when requested from server , , is given by
[TABLE]
for any , and .
Proof.
This corollary follows from (6) by setting . ∎
The server utilization for the video files at server is given as . Since , using Lemma 6, we have
[TABLE]
Having characterized the service time distribution of the video files via a Laplace-Stieltjes Transform , the Laplace-Stieltjes Transform of the waiting time can be characterized using Pollaczek-Khinchine formula for M/G/1 queues [38], since the request pattern is Poisson and the service time is general distributed. Thus, the Laplace-Stieltjes Transform of the waiting time is given as
[TABLE]
Having characterized the Laplace-Stieltjes Transform of the waiting time and knowing the distribution of , the Laplace-Stieltjes Transform of the download time is given as
[TABLE]
We note that the expression above holds only in the range of when and . Further, the server utilization must be less than . The overall download time of all the chunks for the segment at the client, , is given by
[TABLE]
IV-B Play Time of Each Video Segment
Let be the time at which the segment is played (started) at the client. The startup delay of the video is . Then, the first segment can be played at the maximum of the time the first segment can be downloaded and the startup delay. Thus,
[TABLE]
For , the play time of segment of file is given by the maximum of the time it takes to download the segment and the time at which the previous segment is played plus the time to play a segment ( seconds). Thus, the play time of segment of file , can be expressed as
[TABLE]
Equation (13) gives a recursive equation, which can yield
[TABLE]
Since from (11), can be written as
[TABLE]
where
[TABLE]
We next give the moment generating function of that will be used in the calculations of the QoE metrics in the next sections. Hence, we define the following lemma.
Lemma 2**.**
The moment generating function for , is given as
[TABLE]
where
[TABLE]
Proof.
The proof is provided in Appendix C. ∎
Ideally, the last segment should be completed by time . The difference between and gives the stall duration. Note that the stalls may occur before any segment. This difference will give the sum of durations of all the stall periods before any segment. Thus, the stall duration for the request of file is given as
[TABLE]
In the next two sections, we will use this stall time to determine the bounds on the mean stall duration and the stall duration tail probability.
V Mean Stall Duration
In this section, we will provide a bound for the first QoE metric, which is the mean stall duration for a file . We will find the bound by probabilistic scheduling and since probabilistic scheduling is one feasible strategy, the obtained bound is an upper bound to the optimal strategy.
Using (19), the expected stall time for file is given as follows
[TABLE]
An exact evaluation for the play time of segment is hard due to the dependencies between random variables for different values of and , where and . Hence, we derive an upper-bound on the playtime of the segment as follows. Using Jensen’s inequality [39], we have for ,
[TABLE]
Thus, finding an upper bound on the moment generating function for can lead to an upper bound on the mean stall duration. Thus, we will now bound the moment generating function for .
[TABLE]
where (a) follows from (15), (b) follows by upper bounding by , (c) follows by probabilistic scheduling where , and . We note that the only inequality here is for replacing the maximum by the sum. Since this term will be inside the logarithm for the mean stall latency, the gap between the term and its bound becomes additive rather than multiplicative.
Substituting (22) in (21), we have
[TABLE]
Let , where is defined in equation (18). We note that can be simplified using the geometric series formula as follows.
Lemma 3**.**
[TABLE]
where , is given in (2), and is given in (7).
Proof.
The proof is provided in Appendix D. ∎
Substituting (23) in (20) and some manipulations, the mean stall duration is bounded as follows.
Theorem 1**.**
The mean stall duration time for file is bounded by
[TABLE]
*for any , ,
.*
Proof.
The proof is provided in Appendix E. ∎
Note that Theorem 1 above holds only in the range of when which reduces to , and . Further, the server utilization must be less than for stability of the system.
We note that for the scenario, where the files are downloaded rather than streamed, a metric of interest is the mean download time. This is a special case of our approach when the number of segments of each video is one, or . Thus, the mean download time of the file follows as a special case of Theorem 1. We note that the authors of [16, 14] gave an upper bound for mean file download time using probabilistic scheduling. However, the bound in this paper is different since we use moment generating function based bound. The two bounds are compared in Section VIII, and the bounds in this paper are shown to outperform those in [16, 14].
VI Stall Duration Tail Probability
The stall duration tail probability of a file is defined as the probability that the stall duration tail is greater than (or equal) to . Since evaluating in closed-form is hard [29, 28, 12, 16, 14, 15], we derive an upper bound on the stall duration tail probability considering Probabilistic Scheduling as follows.
[TABLE]
where follows from (20) and . Then,
[TABLE]
where follows from (15), (c) follows as both max over and max over are discrete indicies (quantities) and do not depend on other so they can be exchanged, (d) follows by replacing the max by , (e) follows from probabilistic scheduling. Using Markov Lemma, we get
[TABLE]
We further simplify to get
[TABLE]
where (f) follows from (54). Substituting (29) in (27), we get the stall duration tail probability as described in the following theorem (details are provided in Appendix F).
Theorem 2**.**
The stall distribution tail probability for video file is bounded by
[TABLE]
*for any , ,
, and is given by (24).*
We note that for the scenario, where the files are downloaded rather than streamed, a metric of interest is the latency tail probability which is the probability that the file download latency is greater than . This is a special case of our approach when the number of segments of each video is one, or . Thus, the latency tail probability of the file follows as a special case of Theorem 2. In this special case, the result reduces to that in [13].
VII Optimization Problem Formulation and Proposed Algorithm
VII-A Problem Formulation
Let , , and . Note that the values of ’s used for mean stall duration and the stall duration tail probability can be different and the parameters and indicate these parameters for the two cases, respectively. We wish to minimize the two proposed QoE metrics over the choice of scheduling and access decisions. Since this is a multi-objective optimization, the objective can be modeled as a convex combination of the two QoE metrics.
Let be the total arrival rate. Then, is the ratio of video requests. The first objective is the minimization of the mean stall duration, averaged over all the file requests, and is given as . The second objective is the minimization of stall duration tail probability, averaged over all the file requests, and is given as . Using the expressions for the mean stall duration and the stall duration tail probability in Sections V and VI, respectively, optimization of a convex combination of the two QoE metrics can be formulated as follows.
[TABLE]
[TABLE]
Here, is a trade-off factor that determines the relative significance of mean and tail probability of the stall durations in the minimization problem. Varying to , the solution for (31) spans the solutions that minimize the mean stall duration to ones that minimize the stall duration tail probability. Note that constraint (40) gives the load intensity of server . Constraint (41) gives the aggregate arrival rate for each node for the given probabilistic scheduling probabilities and arrival rates . Constraints (43)-(44) guarantees that the scheduling probabilities are feasible. Constraints (45)-(48) ensure that exist for each and . Finally, Constraints (49)-(50) ensure that the moment generating function given in (18) exists. We note that the the optimization over helps decrease the objective function and gives significant flexibility over choosing the lowest-queue servers for accessing the files. The placement of the video files helps separate the highly accessed files on different servers thus reducing the objective. Finally, the optimization over the auxiliary variables gives a tighter bound on the objective function. We note that the QoE for file is weighed by the arrival rate in the formulation. However, general weights can be easily incorporated for weighted fairness or differentiated services.
Note that the proposed optimization problem is a mixed integer non-convex optimization as we have the placement over servers and the constraints (49) and (50) are non-convex in . We also note the placement may be decided for multiple aggregation VMs simultaneously and may not be a parameter for single aggregation VM. In that case, the proposed algorithm can still be used without an optimization over the placement of video files. In the next subsection, we will describe the proposed algorithm.
VII-B Proposed Algorithm
The joint mean-tail stall duration optimization problem given in (31)-(51) is optimized over three set of variables: scheduling probabilities , auxiliary parameters , and chunk placement . Since the problem is non-convex, we propose an iterative algorithm to solve the problem. The proposed algorithm divides the problem into three subproblems that optimize one variable fixing the remaining two. The three sub-problems are labeled as (i) Access Optimization optimizes for given and , (ii) Auxiliary Variables Optimization optimizes for given and , and (iii) Placement Optimization optimizes for given and . This algorithm is summarized as follows.
Initilization: Initialize , , and in the feasible set. 2. 2.
**While Objective Converge **
- (a)
Run Access Optimization using current values of and to get new values of 2. (b)
Run Auxiliary Variables Optimization using current values of and to get new values of 3. (c)
Run Placement Optimization using current values of and to get new values of and .
We first initialize , and such that the choice is feasible for the problem. Then, we do alternating minimization over the three sub-problems defined above. We will describe the three sub-problems along with the proposed solutions for the sub-problems in Appendix G. Each of the three sub-problems are solved by iNner cOnVex Approximation (NOVA) algorithm proposed in [17], and is guaranteed to converge to a stationary point. Since each sub-problem converges (decreasing) and the overall problem is bounded from below, we have the following result.
Theorem 3**.**
The proposed algorithm converges to a stationary point.
VIII Numerical Results
In this section, we evaluate our proposed algorithm for optimization of mean and tail probability of stall duration and show the effect of the trade-off of parameter . We first study the two extremes where only either mean stall duration objective or tail stall duration probability is considered. Then, we show the tradeoff between the two QoE metrics based on the trade-off parameter .
VIII-A Numerical Setup
We simulate our algorithm in a distributed storage system of distributed nodes, where each video file uses an erasure code. These parameters were chosen in [14] in the experiments using Tahoe testbed. Further, erasure code is used in HDFS-RAID in Facebook [40] and Microsoft [6]. Unless otherwise explicitly stated, we consider files, whose sizes are generated based on Pareto distribution [41] with shape factor of and scale of , respectively. We note that the Pareto distribution is considered as it has been widely used in existing literature [42] to model video files, and file-size distribution over networks. We also assume that the chunk service time follows a shifted-exponential distribution with rate and shift , whose values are shown in Table I, which are generated at random and kept fixed for the experiments ( Recall that this distribution has been validated in real experiments demonstrated in realistic systems [14, 15]). Unless explicitly stated, the arrival rate for the first files is while for the next files is set to be . Chunk size is set to be equal to s. When generating video files, the sizes of the video file sizes are rounded up to the multiple of sec. We note that a high load scenario is considered for the numerical results. In practice, the load will not be that high. However, higher load helps demonstrate the significant improvement in performance as compared to the lightly loaded scenarios where there are almost no stalls. In order to initialize our algorithm, we use a random placement of files on all the servers. Further, we set on the placed servers with and . However, these choices of and may not be feasible. Thus, we modify the initialization of to be closest norm feasible solution given above values of and . We compare our proposed approach with five strategies:
*Random Placement, Optimized Access (RP-OA): * In this strategy, the placement is chosen at random where any out of servers are chosen for each file, where each choice is equally likely. Given the random placement, the variables and are optimized using the Algorithm in Section VII-B, where -optimization is not performed. 2. 2.
*Optimized Placement, Projected Equal Access (OP-PEA): * The strategy utilizes , and as mentioned in the setup. Then, alternating optimization over placement and are performed using the proposed algorithm. 3. 3.
*Random Placement, Projected Equal Access (RP-PEA): * In this strategy, the placement is chosen at random where any out of servers are chosen for each file, where each choice is equally likely. Further, we set on the placed servers with and . We then modify the initialization of to be closest norm feasible solution given above values of and . Finally, an optimization over is performed to the objective using Algorithm (2). 4. 4.
OP-PSP (Optimized Placement-Projected Service-Rate Proportional Allocation) Policy: The joint request scheduler chooses the access probabilities to be proportional to the service rates of the storage nodes, i.e., . This policy assigns servers proportional to their service rates. These access probabilities are projected toward feasible region for a uniformly random placed files to ensure stability of the storage system. With these fixed access probabilities, the weighted mean stall duration and stall duration tail probability are optimized over the , and placement . 5. 5.
RP-PSP (Random Placement-PSP) Policy: As compared to the OP-PSP Policy, the chunks are placed uniformly at random. The weighted mean stall duration and stall duration tail probability are optimized over the choice of auxiliary variables .
VIII-B Mean Download Time Comparison
We note that when the number of segments, , the mean stall duration is the same as the mean download time of the file. Further, the bounds in this paper are different from those given in [16, 14] even though both the works use probabilistic scheduling. We will now compare our proposed upper-bound on download time of a file with the upper-bound given in [16, 14]. The comparison can be seen in Figure 4, where the above service time distributions are used at the servers. We observe that our bound performs better for all values of arrival rate (), and the relative performance increases with the arrival rate. For instance, our bound is lower than that given in [16, 14] when the arrival rate equals .
VIII-C Mean Stall Duration optimization
In this subsection, we focus only on minimizing the mean stall duration of all files by setting , i.e., stall duration tail probability is not considered.
Convergence of the Proposed Algorithm
Figure 5 shows the convergence of our proposed algorithm, which alternatively optimizes the mean stall duration of all files over scheduling probabilities , auxiliary variables , and placement . We notice that for video files of size 600 sec with storage nodes, the mean stall duration converges to the optimal value within less than iterations.
Effect of Arrival Rate and Video Length
Figure 6 shows the effect of different video arrival rates on the mean stall duration for different-size video length.The different size uses the Pareto-distributed lengths described above. We compare our proposed algorithm with the five baseline policies and we see that the proposed algorithm outperforms all baseline strategies for the QoE metric of mean stall duration. Thus, both access and placement of files are both important for the reduction of mean stall duration. Further, we see that the mean stall duration increases with arrival rates, as expected. Since the mean stall duration is more significant at high arrival rates, we notice a significant improvement in mean stall duration by about 60% ( approximately 700s to about 250s) at the highest arrival rate in Figure 6 as compared to the random placement and projected equal access policy. In Figure 16, Appendix J, we studied the effect of increasing the arrival rate when the video-sizes are equal with mean of 600 sec.
VIII-D Stall Duration Tail Probability Optimization
In this subsection, we consider minimizing the stall duration tail probability, , by setting in (31).
Decrease of Stall Duration Tail Probability with
Figure 7 shows the decay of weighted stall duration tail probability with respect to (in seconds) for the proposed and the baseline strategies. In order to signify (magnify) the small differences, we plot y-axis in logarithmic scale. We observe that the proposed algorithm gives orders improvement in the stall duration tail probabilities as compared to the baseline strategies.
Effect of the number of video files
Figure 8 demonstrates the effect of increase of the number of video files ( from files to files whose sizes are defined based on Pareto) on the stall duration tail probability. The stall duration tail probability increases with the number of video files, and the proposed algorithm manages to significantly improve the QoE as compared to the considered baselines.
VIII-E Tradeoff between mean stall duration and stall duration tail probability
If the mean stall duration decreases, intuitively the stall duration tail probability also reduces. Thus, a question arises whether the optimal point for decreasing the mean stall duration and the stall duration tail probability is the same. We answer the question in negative since for of equal sizes of length 300 sec, we find that at the values of (, ) that optimize the mean stall duration, the stall duration tail probability is 12 times higher as compared to the optimal stall duration tail probability. Similarly, the optimal mean stall duration is 30% lower as compared to the mean stall duration at the value of (, ) that optimizes the stall duration tail probability. Thus, an efficient tradeoff point between the QoE metrics can be chosen based on the point on the curve that is appropriate for the clients.
IX Conclusions
This paper considers video streaming over cloud where the content is erasure-coded on the distributed servers. Two quality of experience (QoE) metrics related to the stall duration, mean stall duration and stall duration tail probability are characterized with upper bounds. The download and play times of each video segment are characterized to evaluate the QoE metrics. An optimization problem that optimizes the convex combination of the two QoE metrics for the choice of placement and access of contents from the servers is formulated. Efficient algorithm is proposed to solve the optimization problem and the numerical results depict the improved performance of the algorithm as compared to the considered baselines. Some possible future directions are provided in Appendix L.
Appendix A Key notations used in this paper
Appendix B Proof of Lemma 1
[TABLE]
Appendix C Proof of Lemma 2
This follows by substituting in (10) and is given by (7) and is given by (2). This expressions holds when and , since the moment generating function does not exist if the above does not hold.
Appendix D Proof of Lemma 3
[TABLE]
Appendix E Proof of Theorem 1
We first find an upper bound on as follows.
[TABLE]
where (d) follows by bounding the maximum by the sum, (e) follows from (17), and (f) follows by substituting .
Further, substituting the bounds (54) and (23) in (20), the mean stall duration is bounded as follows.
[TABLE]
Appendix F Proof of Theorem 2
Substituting (29) in (27), we get
[TABLE]
where (g) follows from (54) and is given by (24).
Appendix G Description of the Algorithms for the Three Sub-Problems
G-A Access Optimization
Given the placement and the auxiliary variables, this subproblem can be written as follows.
**Input: , **
Objective: min
s.t. (40), (41), (42), (43), (49), (50)
var.
In order to solve this problem, we have used iNner cOnVex Approximation (NOVA) algorithm proposed in [17] to solve this sub-problem. The key idea for this algorithm is that the non-convex objective function is replaced by suitable convex approximations at which convergence to a stationary solution of the original non-convex optimization is established. NOVA solves the approximated function efficiently and maintains feasibility in each iteration. The objective function can be approximated by a convex one (e.g., proximal gradient-like approximation) such that the first order properties are preserved [17], and this convex approximation can be used in NOVA algorithm.
Let be the convex approximation at iterate to the original non-convex problem , where is given by (31). Then, a valid choice of is the first order approximation of , e.g., (proximal) gradient-like approximation, i.e.,
[TABLE]
where is a regularization parameter. Note that all the constraints (40), (41), (42), (43), (49), and (50) are linear in . The NOVA Algorithm for optimizing is described in Algorithm 1. Using the convex approximation , the minimization steps in Algorithm 1 are convex, with linear constraints and thus can be solved using a projected gradient descent algorithm. A step-size () is also used in the update of the iterate . Note that the iterates generated by the algorithm are all feasible for the original problem and, further, convergence is guaranteed, as shown in [17] and described in the following lemma.
Lemma 4**.**
For fixed placement and , the optimization of our problem over generates a sequence of decreasing objective values and therefore is guaranteed to converge to a stationary point.
G-B *Auxiliary Variables Optimization *
Given the placement and the access variables, this subproblem can be written as follows.
**Input: , **
Objective: min
s.t. (45), (46), (47),(48), (49), (50),
var.
Similar to Access Optimization, this optimization can be solved using NOVA algorithm. The constraints (45) and (46) are linear in . The next two Lemmas show that the constraints (47), (48), (49), and (50) are convex in respectively.
Lemma 5**.**
The constraints (47) and (48) are convex with respect to .
Proof.
The proof is provided in Appendix H. ∎
Lemma 6**.**
The constraints (49) and (50) are convex with respect to .
Proof.
The proof is provided in Appendix I. ∎
Algorithm 2 shows the used procedure to solve for . Let be the convex approximation at iterate to the original non-convex problem , where is given by (31), assuming other parameters constant. Then, a valid choice of is the first order approximation of , i.e.,
[TABLE]
where is a regularization parameter. The detailed steps can be seen in Algorithm 2. Since all the constraints (45), (48),and (49) have been shown to be convex in , the optimization problem in Step 1 of Algorithm 2 can be solved by the standard projected gradient descent algorithm.
G-C Placement Optimization
Given and , this subproblem finds a permutation of the placement of files on the different servers. Let the given be denoted as and the placement corresponding to this access be . We find a permutation of the servers for each file , and call it is a permutation of the servers from to . Further, having the mapping of the servers for each file, the new access probabilities are . Having these access probabilities, the new placement of the files will be . We note that the constraints (42), (43), and (44) for the access from the modified placement of the servers will already be satisfied. The Placement Optimization subproblem is to find the optimal permutations . The problem can be formally written as follows.
Objective: (31)
s.t. (40), (41), (49), , is a permutation on
var. and
We note that the optimization problem is to find permutations and is a discrete optimization problem. We first consider optimizing only over one of the permutation . Let be written as an indicator function which is if and zero otherwise. Then, the new while for other files , remains the same. With the new values of , the only optimization variables are . The constraints for are and . We note that this is a non-linear bipartite matching problem [43]. All the permutations taken together result in discrete optimization variables that we wish to optimize.
In general, we have the constraints and for all , , where binary for each are the decision variables. In order to solve the non-linear problem with integer constraints, we use NOVA algorithm, where a term is added in the objective for each constraint (to make the problem smooth), where is a large number and is large enough to force the solutions to be binary. NOVA algorithm guarantees convergence for any given value of and thus for large enough , we will obtain the stationary point that has integer constraints.
Appendix H Proof of Lemma 5
The constraints (47) and (48) are separable for each and and thus it is enough to prove convexity of . Thus, it is enough to prove that .
The first derivative of is given as
[TABLE]
Differentiating it again, we get the second derivative as follows.
[TABLE]
Since , given in (60) is non-negative, which proves the Lemma.
Appendix I Proof of Lemma 6
The constraints (49) and (50) are separable for each each and , and thus it is enough to prove convexity of for . Thus, it is enough to prove that for . We further note that it is enough to prove that , where . Hence, the first derivative of is given as
[TABLE]
Note that since . Differentiating it again to get the second derivative, we get the second derivative as follows.
[TABLE]
Since , given in (62) is non-negative, which proves the Lemma.
Appendix J Additional Simulation Figures
In this section, in addition to the variations studied earlier, we will explore the effects of changing some other system parameters, i.e., the number of servers, the number of video files, the increase of video request arrival rates, and the code choice on the stall durations.
**Effect of number of servers: ** Figure 10 depicts the mean stall duration for increasing number of servers (, , , ). We note that the mean stall duration decreases with increase of servers.
**Effect of encoding parameters: ** Figure 11 depicts the weighted stall duration tail probability for varying the number of files, and for different choices of code parameters. We first note that the weighted stall duration tail probability is higher for larger number of files. Further, we note that the code with larger for the same value of performs better. This is because larger value of gives more choice for the selection of servers. Thus, performs better than and performs better than . Among and , the additional redundancy is . With the same number of parity symbols, it is better to have larger value of since smaller chunks are obtained from each server helping stall durations. Since the replication has , this analysis thus shows that an erasure code with the same redundancy can help achieve better stall durations.
**Performance with Repetition Coding: **
Figure 12 shows the effect of different video arrival rates on the mean stall duration for different-size video length when each file uses erasure-code (which is triple-replication). We compare our proposed algorithm with the five baseline policies and see that the proposed algorithm outperforms all baseline strategies for the QoE metric of mean stall duration. Thus, both access and placement of files are important for reducing the mean stall duration. We see that the mean stall duration of all approaches increases with arrival rates. However, since the mean stall duration is more significant at high arrival rates, we see the significant improvement in the mean stall duration of our approach as compared to the considered baselines.
Effect of Arrival Rates Figure 14 demonstrates the effect of increasing workload, obtained by varying the arrival rates of the video files from to , where is the base arrival rate, on the stall duration tail probability for video lengths generated based on Pareto distribution defined above. We notice a significant improvement of the QoE metric with the proposed strategy as compared to the baselines. At the arrival rate of , the proposed strategy reduces the stall duration tail probability by about 100% as compared to the random placement and projected equal access policy.
Convergence of Stall Duration Tail Probability Figure 15 demonstrates the convergence of our proposed algorithm for different values of . Considering files of length 300s each with storage nodes, the stall duration tail probability converges to the optimal value within less than iterations.
Effect of the Number of Video Files Figure 13 demonstrates the impact of varying the number of video files from files to files on the mean stall duration, where the video lengths are generated according to Pareto distribution with the same parameter defined earlier (scale of 300, and shape of 2). We note that the proposed optimization strategy effectively reduces the mean stall duration and outperforms the considered baseline strategies. Thus, joint optimization over all three variables , , and helps reduce the mean stall duration significantly.
Appendix K Extension to more streams between the server and the edge router
In this section, we investigate extending the proposed approach to the case when there are parallel streams from each server to the edge router. Multiple streams can help obtain parallel video files thus helping one file not wait behind the other. We label the streams from server as (graphically depicted in Figure 17). The analysis in this paper considers only one stream between the server and the edge router. We now show how the analysis can be adapted when there are multiple streams. We first note that the scheduling need to decide not only the server but also the parallel stream . We assume that the parallel stream is chosen equally likely. Further, the multiple streams are obtained through equal bandwidth splits, and thus the service time parameters would be different for streams as compared to the server. For instance, the service rate would be a factor of of the service rate from the server due to the bandwidth split. Thus, the probability of choosing server and stream is
[TABLE]
where is the probability of choosing server . Using this, we note that the analysis of download time from a server can be modified to download time from a stream of a server and the steps can be directly extended. The ordered statistics can use the above probabilistic scheduling to choose a stream of a server and thus the entire analysis can be easily extended.
Since the optimization also has the same parameters, we show an improvement of the mean stall duration with the number of parallel streams in Fig. 18. The choice of the number of streams can be determined by the practical limitations (e.g., number of ports possible at the server). A more detailed analysis of the parallel streams, exploiting the flexibilities of splitting of bandwidths among the different streams, choosing one of the multiple parallel streams for each video are being considered by the PI in [44, 45].
Appendix L Future Directions
A server does not need to serve different video requests one after the other. It may be better to serve video segments out of order from a queue thus helping stall durations since the later video requests do not have to wait for finishing chunks of earlier requests which have later deadlines. Exploiting this flexibility is an open problem.
Most edge routers would have a cache capacity, where certain segments of video files can be stored to improve the QoEs. Analyzing QoEs and finding efficient caching mechanisms for video streaming over cloud is an open problem (See Appendix M for more details). Further, implementing the ideas in this paper over a real cloud computing environment is left as a future work. This paper also does not consider the last hop, and incorporating that is left as a future work (See Appendix N for more details).
We note that the current video streaming algorithms use adaptive bit-rate (ABR) strategies to change the video qualities of segments within a video [8, 9]. One of the strategies look at the buffer usage at the client to determine the quality of the next segment [8]. Incorporating efficient ABR streaming algorithms is an interesting future work. The main challenge in this extension is to incorporate the client behavior which makes the arrival process non-memoryless thus making the analysis complex. Finally, considering the decoding time by combining data in the calculations is left as a future work.
Appendix M Impact Of Caching
So far, our analysis did not account for caching. In this Appendix, we present how our model can be extended to accommodate for the impact of caching. Caching content at the network edge, closer to the customers, can further help reducing the stall duration and thus improve the QoE. However, caching the video content has to address a number of crucial challenges that differ from caching of web objects, see for instance [46] and the references therein for detailed treatment of this aspect.
There are two methods for caching the video files. The first involves caching the complete video file (all video chunks of file ) at edge routers. The second method involves caching partial chunks, i.e., , where , for video file . Most of the current caching schemes cache entire files (for example, hot files). Our analysis can accommodate both of these methods. In the first method, the video file is entirely cached, and is thus not requested from the servers. This is equivalent to changing the arrival rate of these files to zero, i.e., . In the second method, only the later are needed from the servers. This can be easily incorporated by requesting the video of length , while the first chunk can wait for an additional time which can be accounted by adding in the startup delay for this file.
Mathematically, we can show that for , where , the random download time of the remaining segments from server is given as
[TABLE]
Since video file consists of segments stored at server , the total service time for video file , denoted by , is given as
[TABLE]
Hence, the service time of the video files at server is given by
[TABLE]
where is the total arrival rate at server . Also, we can show that the MGF of the service time for all video files from server is given by
[TABLE]
Further, the current load intensity at server , , is as follows
[TABLE]
Similar to the previous analysis, since the arrival is Poisson and the service time is shifted-exponentially distributed, the MGF of the waiting time at queue server can be calculated usingthe Pollaczek-Khinchine formula, i.e.,
[TABLE]
From the MGF of and the service time, the MGF of the download time of segment from server for file is then
[TABLE]
Having characterized the download time for chunk , we can then determine the stall durations and evaluate the QoE metrics as in Equations (25) and (30). The details are omitted here as they can easily follow.
Appendix N End-to-End Analysis
In this Appendix, we show how our analysis can be extended to consider the last hop from the edge-router to the user. If the last hop is considered, the download time of the chunk for video file , if requested from server can be written as follows
[TABLE]
where is the waiting time in the queue of server , is the service time for the chunk , is the chunk size in seconds, is the bit-rate for user , and is the average bandwidth when downloading chunk . Thus, as long as can be bounded, this is the additional stall duration (or additional startup delay). In most wired setups, the capacity for the last hop may not be a bottleneck, and thus this term is negligible and not varying significantly with . Even for wireless network in homes, the average bandwidth numbers are much higher than the video rate, and thus this additional term may not be a bottleneck. Thus, the analysis can be easily extended to the last hop. Since the last hop is dependent on the user and the cloud provider wishes to optimize the system such that it does the best delivery in the part controlled by the provider, we did not explicitly consider the last hop. However, as long as the last hop capacity is higher than the data rate of the video, the last hop does not affect the analysis except a small additional delay.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Marketsandmarkets, “Solution, by service, by platform, by user type, by deployment type, by revenue model, by industry, and by region - global forecast to 2021,” http://www.marketsandmarkets.com/Market-Reports/video-streaming-market-181135120.html , May 2016.
- 2[2] D. Mowrey, “Cloud video trends to watch in 2017,” http://www.multichannel.com/blog/mcn-guest-blog/cloud-video-trends-watch-2017/409903 , Jan 2017.
- 3[3] H. Weatherspoon and J. Kubiatowicz, “Erasure coding vs. replication: A quantitative comparison,” in Revised Papers from the First International Workshop on Peer-to-Peer Systems , ser. IPTPS ’01. Springer-Verlag, 2002.
- 4[4] A. Dimakis, P. Godfrey, Y. Wu, M. Wainwright, and K. Ramchandran, “Network coding for distributed storage systems,” Information Theory, IEEE Transactions on , vol. 56, no. 9, pp. 4539–4551, Sept 2010.
- 5[5] M. Sathiamoorthy, M. Asteris, D. Papailiopoulos, A. G. Dimakis, R. Vadali, S. Chen, and D. Borthakur, “Xoring elephants: Novel erasure codes for big data,” in Proceedings of the 39th international conference on Very Large Data Bases. , 2013.
- 6[6] C. Huang, H. Simitci, Y. Xu, A. Ogus, B. Calder, P. Gopalan, J. Li, and S. Yekhanin, “Erasure coding in windows azure storage,” in Proceedings of the 2012 USENIX Conference on Annual Technical Conference , ser. USENIX ATC’12. USENIX Association, 2012.
- 7[7] A. Fikes, “Storage architecture and challenges (talk at the google faculty summit),” http://bit.ly/n Uyl RW, Tech. Rep., 2010.
- 8[8] T.-Y. Huang, R. Johari, N. Mc Keown, M. Trunnell, and M. Watson, “A buffer-based approach to rate adaptation: Evidence from a large video streaming service,” ACM SIGCOMM Computer Communication Review , vol. 44, no. 4, pp. 187–198, 2015.
