Generalized piggybacking codes for distributed storage systems
Shuai Yuan, Qin Huang, Zulin Wang

TL;DR
This paper introduces a generalized piggybacking coding scheme for distributed storage that reduces repair bandwidth and computational complexity by optimizing the proportion of protected instances, outperforming existing designs.
Contribution
It extends piggybacking codes by analyzing protected instances, optimizing their proportion, and demonstrating improved repair efficiency and lower complexity.
Findings
Proportion of protected instances affects repair bandwidth.
Optimized codes approach zero repair ratio as parity nodes increase.
Lower computational complexity than existing piggybacking codes.
Abstract
This paper generalizes the piggybacking constructions for distributed storage systems by considering various protected instances and piggybacked instances. Analysis demonstrates that the proportion of protected instances determines the average repair bandwidth for a systematic node. By optimizing the proportion of protected instances, the repair ratio of generalized piggybacking codes approaches zero instead of 50% as the number of parity check nodes tends to infinity. Furthermore, the computational complexity for repairing a single systematic node cost by generalized piggybacking codes is less than that of the existing piggybacking designs.
| RSR-II codes | generalized piggybacking codes | ||||
| stripes | stripes | ||||
| 7 | 0.5886 | 2 | 0.6400 | ||
| 17 | 0.5341 | 3 | 0.4867 | ||
| 27 | 0.5207 | 4 | 0.4133 | ||
| 37 | 0.5147 | 5 | 0.3700 | ||
| 47 | 0.5114 | 5 | 0.3344 | ||
| 77 | 0.5068 | 6 | 0.2740 | ||
| 197 | 0.5026 | 10 | 0.1819 | ||
| Multiplications | Additions | |
|---|---|---|
| MDS decoding | ||
| Solving linear | ||
| combinations |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Caching and Content Delivery · Distributed systems and fault tolerance
Generalized Piggybacking Codes for Distributed Storage Systems
Shuai Yuan1, Qin Huang2,1,∗, Senior Member, IEEE, Zulin Wang2, Member, IEEE
1Qian Xuesen Laboratory of Space Technology
China Academy of Space Technology, Beijing, China, 100094
2School of Electronic and Information Engineering
Beihang University, Beijing, China, 100191
Email: [email protected]; [email protected]; [email protected] Part of this paper has been accepted by IEEE Global Communications Conference (IEEE Globecom 2016). Corresponding author: Q. Huang.
Abstract
This paper generalizes the piggybacking constructions for distributed storage systems by considering various protected instances and piggybacked instances. Analysis demonstrates that the proportion of protected instances determines the average repair bandwidth for a systematic node. By optimizing the proportion of protected instances, the repair ratio of generalized piggybacking codes approaches zero instead of 50% as the number of parity check nodes tends to infinity. Furthermore, the computational complexity for repairing a single systematic node cost by generalized piggybacking codes is less than that of the existing piggybacking designs.
Index Terms:
piggybacking, distributed storage systems, MDS, node repair
I Introduction
Nowadays, distributed storage systems (DSSs) are being increasingly employed by network applications. Data in DSSs is deployed over multiple storage devices. However, these discrete devices are prone to failure because of malfunctions or maintenance. In order to ensure the reliability of the stored data even in the occurrence of node unavailability, DSSs are supposed to introduce redundancy to resist storage node failures. Replication is the simplest redundant fashion, and has been adopted to improve the reliability by many DSSs, such as the Google File System [1] and the Hadoop Distributed File System (HDFS) [2]. With the rapid growth of amount of storage data, erasure coding has become a better choice for DSSs. Compared with replication, it is able to provide orders of magnitude reliability increasing for same storage resource consumption [3]. As a result, several large-scale systems, such as OceanStore [4], Total Recall [5], Windows Azure Storage [6], and Google Colossus(GFS2) [7], have employed erasure coding techniques to improve their storage efficiency.
Maximum distance separable (MDS) codes as one kind of erasure codes have been introduced into many DSSs for their optimal storage efficiency. MDS property can be used to recover missing data in a DSS. Consider an -node DSS deployed with an MDS code. If one node of this storage system is failed, data stored in nodes is required to reconstruct the missing data in this failure node. times amount of stored data is needed to recover the missing data. Thus, the usage of network and disk is significantly high, i.e., the repair efficiency is very low. To address this repair issue, many codes have been constructed to reduce the transmission data for repairing failure node.
As the statement in [8], there are three types of node repair: exact repair, functional repair and exact repair of the systematic part. However, exact repair is the most considered from in practical DSSs. In [9], Dimakis et al. defined the amount of transmission data during repairing one single failed node as repair bandwidth. The authors derived an optimal tradeoff between storage and repair bandwidth (theoretic cut-set bound), and proposed regenerating codes which lie on the tradeoff curve. In [10, 11, 12, 13, 14], the existence and the construction of regenerating codes have been studied. However, the optimal tradeoff provided by regenerating codes was only derived for functional repair. Almost all the interior points on the storage-bandwidth tradeoff are not achievable under exact repair [15].
MDS array codes are another important class of erasure codes used in DDSs. They have the advantage of simple encoding and decoding procedures, so that they can be easily implemented in hardware devices. Many designs of MDS array codes, such as EVENODD [16], B-code [17], X-code [18], RDP [19], STAR [20] and Zigzag codes [21], have been presented for storage and communication applications. However, the repair bandwidth of MDS array codes can not achieve the theoretic cut-set bound.
In 2011, Rashmi et al. proposed a new kind of distributed storage codes called piggybacking codes to reduce the data amount read and downloaded for node repair [22]. The key idea of piggybacking codes is taking several instances of an existing base code, and attaching linear combinations of symbols in some protected instances to other non-protected instances. Hence, the missing symbols in protected instances are able to be recovered by solving these linear equations instead of MDS decoding. Piggybacking is a simple and useful construction to improve the repair efficiency of missing nodes. Several designs of piggybacking codes were presented in [22] and [23]. These designs are able to save to repair bandwidth for one failed node on average. Facebook Warehouse Cluster and the new Hadoop Distributed File System (HDFS) have employed piggybacking codes to improve their repair efficiency [24].
Although piggybacking codes are practical and easy to implementation, the reduction of repair bandwidth of the proposed piggybacking designs still has a gap to the theoretic cut-set bound of regenerating codes. In [23], Rashmi, Shah, and Ramchandran gave three specific piggybacking constructions. The second one we represent with RSR-II is the most efficient construction in terms of repair bandwidth. The description in [23] shows that RSR-II codes are able to save up to of repair bandwidth. This paper investigates the mechanism in reduction of repair bandwidth by using piggybacking codes. From the recovery methods of the systematic symbols, we distinguish instances of piggybacking codes with protected stripes and non-protected stripes. An analysis of a lower bound on the repair bandwidth of RSR-II codes implies that the proportion of protected instances determines the repair efficiency of piggybacking constructions.
This paper firstly presents a generalized piggybacking design with various protected and non-protected stripes in order to obtain various proportion of protected stripes. Second, a lower bound and an upper bound on the repair bandwidth of generalized piggybacking codes are introduced. The analysis of the two bounds indicates that by optimizing the proportion of protected stripes, the repair ratio ( defined as average repair bandwidth as a fraction of the amount of original messages) of a generalized piggybacking code approaches zero instead of as the number of parity check nodes tends to infinity. It is closer to that of minimum storage regenerating (MSR) codes which has the theoretical lower bound. At last, the computational complexity for the repair of a single failed systematic node is analyzed. The results show that the generalized piggybacking codes are able to provide more efficient repair with little complexity overhead.
The remainder of this paper is organized as follows. Section II briefly introduces the piggybacking framework and RSR-II codes. Section III performs an analysis of the repair efficiency of RSR-II codes. Our generalized piggybacking codes are presented in Section IV. Finally, the conclusion is given in Section V.
II Background
II-A Maximum distance separable codes
Consider an linear block code , where is its code length, is its dimension, and represents the minimum Hamming distance. Code is called an MDS code, if its minimum Hamming distance meets the Singleton bound, i.e.,
[TABLE]
MDS codes are an important class of linear block codes. For given parameters and , the minimum distance reaches the maximum possible value. Thus, MDS codes are able to correct as many as erasures for given and .
MDS codes have been extensively applied in many DSSs. In an -node storage system, initially the original message is divided into information packets. Subsequently, the packets are encoded into packets and stored in the nodes respectively. With the MDS property, messages from any out of nodes could reconstruct the original message. Thus, the system is able to tolerate the failures of any storage nodes.
II-B Piggybacking framework
In this subsection, we introduce the piggybacking framework which is the basis of constructing piggybacking codes. Piggybacking framework guarantees that DSSs are able to employ piggybacking codes without extra cost of storage. Moreover, the decoding properties of the error-correction codes adopted by original DSSs, such as the minimum distance or the MDS property, are not ruined by piggybacking reconstruction.
In general, the piggybacking framework operates on multiple instances of an existing base code and adds several designed functions of the data in some instances onto other instances. The base code of piggybacking framework can be arbitrary. In fact, it is a very attractive feature in practice. Under the piggybacking framework, the DSSs enjoy a repair bandwidth reduction with only small modification based on their existing error-correction codes.
Consider a linear block code represented by encoding functions . Suppose is the original message of . The encoded symbols are . For an -node system, using as the base code, the piggybacking framework, which has instances of , is illustrated in Fig.1.
As shown in Fig.1, the rows correspond to the storage nodes, the columns are called stripes, are independent original messages and are piggyback functions.
It is a very important consideration that the piggyback functions added on the -th stripe can only be linear combinations of original messages of stripes . This principle guarantees that all the stripes of this piggybacking framework are decodable through a recursion process: In stripe 1, no piggyback functions are added, so the original message can be directly recovered by using the decoding procedure of . For stripe 2, with the decoded , it is easy to compute the added piggyback functions and subtract them from the stored symbols. Then, is decodable. In a similar way, after the decoding procedures of stripes are finished, are available to the piggyback functions . The base code of this stripe is obtained after subtracting these piggybacking functions, so that can be recovered.
As the statement above, the symbols stored in one node are independent. Sometimes, an invertible linear transformation is performed to simplify the computation. Such a transformation still retains the decoding properties of the piggybacking framework.
II-C RSR-II codes
Under the piggybacking framework described in Section.II-B, Rashmi et al. have presented three designs of piggybacking codes for different considerations. The second design RSR-II is constructed for the purpose of pursuing high efficiency of repair. As the statement in [23], RSR-II codes can save up to repair bandwidth of a systematic node.
For the sake of simple description, an MDS code in systematic form is chosen as the base code. Denote as the number of parity check nodes. RSR-II codes consist of instances of the base code. Represent the associated original messages as , where is a vector of length , and . Then, the stripes are shown in the following form:
[TABLE]
where are encoding vectors corresponding to the parity check symbols of the base code.
The piggyback functions of RSR-II codes are linear combinations of the systematic symbols of the first stripes, and they are added on the last parity check symbols of the last stripes. The construction of these piggyback functions is taken in three steps.
First, the systematic nodes are split into node sets as evenly as possible. Without loss of generality, we suppose is not a multiple of , and define three variables as follows,
[TABLE]
Hence, the first node sets are of size , and the remaining are of size .
Second, define two sets of vectors of length and with
[TABLE]
Then, introduce selection vectors to separate the tuples in each vector of into segments. And the selection vectors are defined as follows
[TABLE]
where ’s are diagonal matrices of size . On the diagonal of , only the positions corresponding to the systematic nodes in are “1”. Therefore,
[TABLE]
Finally, add the piggyback functions of and into the parity check symbols in the last nodes. Hence, node +, , has the following form as shown in Fig.2(a). An invertible linear transformation is introduced to reduce the complexity for node repair. Finally, symbols in node + are illustrated in Fig.2(b).
II-D Repair bandwidth of RSR-II codes
We use repair ratio to represent the measure of repair efficiency of a distributed storage code. Repair ratio is defined as the average amount of transfer data needed for repairing one failure node as a fraction of original messages. In this subsection, we recall the repair procedure of one systematic node by RSR-II codes. Then, the repair ratio of RSR-II is computed.
Consider an -node DSS deployed with an RSR-II code. For the sake of simple description, we represent the first stripes as protected stripes, whose systematic symbols are involved in the piggyback functions and defined as protected symbols. Meanwhile, the last stripes are represented as non-protected stripes, whose systematic symbols are named with non-protected symbols. If the -th systematic node fails, repair procedure of this node is to recover the missing protected symbols and the missing non-protected symbols . Assume node belongs to which is one of the node sets described in Section.II-C. The repair procedure is described in Algorithm 1.
From Algorithm 1, symbols are needed to be downloaded in step 1, and symbols are needed in step 2. In step 3, if the size of is , the number of downloaded symbols is . Otherwise, if the size is , symbols are downloaded. We denote the average repair bandwidth of one systematic node as . The number of systematic nodes in the node sets of size is , and the number of those systematic nodes in the node set of size is . Thus
[TABLE]
Thus, the repair ratio is
[TABLE]
III Efficiency Analysis for RSR-II Codes
In this section, a further analysis on the repair efficiency of RSR-II is performed.
Here, we introduce a notation stripe-repair ratio to measure the repair efficiency of one stripe
[TABLE]
Consider a piggybacking code with stripes. Assume the stripe-repair ratios of these stripes are . Denote the proportions of these stripes as . Thus, the repair ratio for systematic nodes of this piggybacking code has the following form,
[TABLE]
Recall the RSR-II codes described in Section II-D. The repair procedure deals with the missing protected and non-protected symbols in two different measures: MDS decoding is adopted for the recovery of non-protected symbols, and the amount of downloading for repairing one missing non-protected symbol is symbols. As regard to the missing protected symbols, solving linear combinations is employed, and the average bandwidth is or , which depends on the size of node set containing the failure node. Denote and as the stripe-repair ratios of protected and non-protected stripes, respectively. The amount of original message of one stripe equals to the symbols stored in the systematic nodes. Hence,
[TABLE]
Although only an approximate value of is given by Equation (10), it is obvious that , i.e., repair procedure for protected stripes requires less downloaded symbols compared with non-protected stripes. This is the mechanism in reduction of repair bandwidth by using piggybacking codes.
In the remainder of this section, we explore the critical factors influencing the repair efficiency through an analysis of . Represent the proportion of protected stripes with . Thus, the proportion of non-protected stripes is . Rewrite as the form of Equation (9). Then,
[TABLE]
where , and . The inequality of quadratic and arithmetic means tells that for nonnegative integers , they satisfy the following inequality.
[TABLE]
Thus,
[TABLE]
with equality if and only if , i.e., is a multiple of . In this case, is able to reach a lower bound , and
[TABLE]
According to Equation (15), approaches as the number of parity check nodes tends to infinite, i.e., RSR-II codes are able to save at most repair bandwidth. For a DSS whose parameters are given, in order to further improve the repair efficiency, the structure of piggybacking design is supposed to be modified. As the analysis above, the protected stripe-repair ratio is smaller than . It implies that the repair efficiency of piggybacking codes may be improved by increasing according to Equation (12). Actually, larger means more protected symbols involved in one piggyback function that leads to the reduction of . Therefore, it is possible to improve the repair efficiency of piggybacking codes by optimizing the proportion of protected stripes .
IV Generalized Piggybacking Codes
In this section, we present a generalized construction which contains various protected and non-protected stripes. An analysis is performed to clarify the relationship between repair ratio and the proportion of protected stripes . The results show that our proposed generalized piggybacking codes are able to provide more efficient node repair by optimizing . The repair ratio of the generalized piggybacking codes approaches zero when the number of the parity check nodes tends to infinity.
IV-A Code design
Similarly, choose an systematic MDS code as the base code of a generalized piggybacking code. is the parity check number. Two parameters and are introduced to represent the numbers of protected and piggybacked stripes, respectively. Figure 3 depicts the instances of .
According to the construction principle of piggybacking framework, piggyback functions added on the -th stripe should only involve the original messages of the stripes . For the sake of simple analysis, we add the piggyback functions only on the parity check symbols in non-protected stripes. Redefine the non-protected stripes as piggybacked stripes. As illustrated in Fig.3, all symbols stored in the stripes are divided into 4 regions.
- •
Region A contains all the systematic symbols of the protected stripes.
- •
Region B contains all the systematic symbols and the first parity check symbol of the piggybacked stripes.
- •
Region C contains all the parity check symbols of the protected stripes.
- •
Region D contains the last parity check symbols of the piggybacked stripes.
Once a systematic node failure happens, the repair procedure is supposed to regenerate the missing symbols in Region A and B. Similar to RSR-II codes, the systematic symbols in Region B are self-sustaining: According to the MDS property, missing symbols in one row of Region B could be recovered by the surviving symbols in the other rows. As for the systematic symbols in Region A, piggybacking functions are constructed to protected them. These piggyback functions are supposed to be embedded in Region D. The size of Region D is , i.e., at most piggyback functions can be designed. It is a noteworthy fact that the failed protected symbols in one row of Region A should be simultaneously recovered by solving a set of linear combinations. In order to guarantee that there are enough piggyback functions to simultaneously recover those missing symbols in Region A, the following inequality must be satisfied when we choose the parameters and .
[TABLE]
In the remainder this subsection, an method of the construction of piggyback functions is illustrated as follows.
- 1
Construct a empty piggybacking array.
Each column of this piggybacking array corresponds to one piggyback function.
- 2
Fill the protected symbols in Region A into the piggybacking array.
The protected symbols in Region A form a array as shown in Fig.3. Step 2 takes these symbols in rowwise from the array and fills them into the piggybacking array. Obviously, if is not divisible by , the last row of this piggyback array would not be full.
- 3
Obtain the piggybacking functions, and add them in Region D. After all protected symbols are allocated into the piggyback array, sum the symbols in each column up. Thus, piggybacking functions are obtained, and they can be added into Region D in an arbitrary order.
It is remarkable that the piggyback functions are only summations of some protected symbols. As a result, the recovery of missing protected symbols could be very simple. An example is presented to illustrate the partition method and the repair procedure.
Example 1**.**
Consider an systematic MDS code as the base code. Set , and . Denote of length 4 as the 5 input message vectors. Thus, the original storage array is
[TABLE]
The protected symbols in Region A are , , and . Fill them into a piggyback array. We have
[TABLE]
Sum the symbols in each column up, and then we achieve the six piggyback functions . Finally, the generalized piggybacking code can be constructed as follows
[TABLE]
IV-B Analysis on repair bandwidth
Recall the construction of piggyback functions in Section.IV-A. If is not dividable by , the systematic symbols partitioned into the piggyback functions are uneven. Here, we define the sizes of these piggyback functions as the numbers of contained systematic symbols in Region A. Without loss of generality, assume the sizes are not all the same, and denote them as . Obviously, they satisfy that
[TABLE]
Suppose that the -th systematic node fails, . All remaining symbols stored in Region B except node are needed to reconstruct with the MDS property. The amount transmitted in this step is symbols. In Region D, the parity check symbols containing the piggyback functions of are required to recover the missing protected symbols. Moreover, the components along should be subtracted out from the downloaded parity check symbols. However, the left piggybacking functions are still involved with some other protected symbols besides . Hence, more symbols in Region A are needed. Assume the sizes of these piggybacking functions are . The download amount of systematic symbols from Region A in this step is .
Now we derive the total bandwidth of repairing all the systematic nodes. Symbols in Region B need to be downloaded times. Consider a parity check symbol stored in Region D. Suppose the size of the piggybacking function embedded in this parity check symbol is . During the repair procedures, the parity check symbol needs to be downloaded times. Meanwhile, each of the involved systematic symbols in Region A needs to be downloaded times. Therefore, the total repair bandwidth of all the systematic nodes is .
From the above, the average repair ratio is
[TABLE]
Rewrite Equation (18) as
[TABLE]
Without loss of generality, assume is not dividable by , and
[TABLE]
Thus, out of piggyback functions have the size of , and the rest ones have the size of . Then, goes to
[TABLE]
In a DSS, the parameters of base code are given. Thus, is varied with different values of . In order to explore the relationship between and the proportion of protected instances , the lower and upper bounds of are derived as follows,
[TABLE]
Rewrite the lower and upper bounds as functions and of . Then,
[TABLE]
Example 2**.**
Assume the code rate of the base code is , i.e., . For various ’s, Figure 4 shows the curves of and with .
It illustrates that the lower bound and upper bound are close to each other. Moreover, both of them can reach their extreme points by optimizing which implies that the generalized piggybacking code can obtain optimum with appropriate parameters .
Further analyze the optimum condition for with the derivatives of and which are with respect to and listed as follows
[TABLE]
Let and equal to zero. Then, we work out the minimum values of and as follows,
, when ; 2. 2.
, when .
The results indicate that
is only determined by the number of parity check nodes ; 2. 2.
is determined by both and . However, for high code rate, is dominantly determined by ; 3. 3.
corresponds closely to . In other words, there exists a generalized piggybacking code whose repair ratio is very close to the lower bound.
Figure 5 shows the curves of and with .
It implies that
[TABLE]
At the end of this subsection, we perform asymptotic analyses of and , and compare them with the repair ratio of minimum storage regenerating (MSR) codes . The limits of and as approaches infinity are
[TABLE]
As described in [9, 8, 25], MSR codes which correspond to the best storage efficiency are one of two most important classes of regenerating codes. The repair bandwidth for one failure node is
[TABLE]
where represents the size of original messages, denotes the number of accessed surviving nodes, and is the dimension of the MSR code. For the sake of simple comparison, we set the code rate to , and such that the MSR code provides the highest repair efficiency. Thus,
[TABLE]
The curves of , and are shown in Fig.6.
It shows that approaches zero instead of as the number of parity check nodes tends to infinity. As a result, compared with RSR-II codes, generalized piggybacking codes are able to provide more efficient node repair with less bandwidth. Moreover, is closer to - the theoretical lower bound of repair ratio.
Table I compares the repair efficiency of RSR-II codes and generalized piggybacking codes with various code parameters and . It is illustrated that with the increasing of the number of parity check nodes, generalized piggybacking codes can reach smaller repair bandwidth.
IV-C Analysis on decoding complexity
In this subsection, the complexity of node repair procedure of generalized piggybacking codes is analyzed first. Then the comparison with RSR-II codes is performed. It is shown that the computational complexity for repairing a single systematic node cost by generalized piggybacking codes is much less than that of RSR-II codes.
As the statement in Section.III and IV-B, piggybacking codes adopt two kinds of calculations to repair a failed node. MDS decoding is used for the recovery of the missing symbols in non-protected or piggybacked stripes, while solving linear combinations is employed to reconstruct the missing symbols in protected stripes. Recall the generalized piggybacking code, in Section.IV-A, which has protected stripes and piggybacked stripes. The repair procedure of the -th systematic node is described in Section.IV-B.
In order to recover - the missing symbol of the -th piggybacked stripe, the symbols are required. Denote the vector representation of () as . Then, can be worked out by the below equation.
[TABLE]
Hence, the MDS decoding for the recovery of one missing symbol in a piggybacked stripe costs multiplications and additions.
Consider the recovery of - the missing symbol in the -th protected stripe. According to the description of Section.IV-A, we denote the piggyback function which involves together with other protected symbols as . In order to reconstruct , from Region D, the stored symbol containing is needed, and the surviving protected symbols are also required. Hence, can be figured out as follows.
- •
Compute the parity check symbol in . This step costs multiplications and additions.
- •
Subtract the parity check symbol from . Thus, 1 addition is needed.
- •
Subtract the surviving protected symbols form the left . Thus, additions are required.
Actually, represents the size of the piggyback function , i.e., equals to or . Therefore, solving linear combinations for one missing protected symbol costs multiplications and additions, on average.
The computational complexity of MDS decoding and solving linear combinations is listed in Table II.
According the analysis in Section.III, solving linear combinations is introduced by piggybacking codes to reduce the repair bandwidth of partial missing symbols. For RSR-II codes, missing protected symbols need to be simultaneously recovered by solving a group of linear functions. As a result, we have to perform Gaussian elimination. However, for generalized piggybacking codes, piggyback functions are simple summations of some protected symbols. Compared with the calculations for MDS decoding, those for solving linear combinations cost only more additions. Thus, the generalized piggybacking framework is able to provide high repair efficiency because it can significantly reduce the repair bandwidth for a single failed systematic node with low computational complexity.
V Conclusion and Discussion
This paper presents a generalized piggybacking construction with various protected instances and piggybacked instances. Compared with the previous design, our proposed generalized piggybacking codes can save more repair bandwidth by optimizing the proportion of protected instances. When the number of parity check nodes tends to infinity, the average repair bandwidth as a fraction of total messages approaches zero. Moreover, complexity analysis demonstrates that generalized piggybacking codes are able to efficiently repair the failed node with reasonable complexity overhead.
In fact, if we look at piggybacking functions from the view of error-correction codes, piggybacking codes are perfect encounter between codes with small minimum Hamming distance and codes with large minimum Hamming distance. The repair of systematic symbols in piggybacked stripes is relied on the base codes of these stripes. These base codes have strong erasure-correction capability due to their large minimum distance. However, it results in strong correlation among all the symbols. Thus, decoding of these ‘good codes’ requests large amount of data access. For the repair of protected stripes, piggybacking functions are linear combinations of the protected systematic symbols. In other words, these symbols together with piggyback functions can be considered as linear codes with small minimum distance. Since these ‘bad codes’ have weak correlation among symbols, their decoding requests small amount of data access.
VI Acknowledgment
We sincerely thank Prof. Shu Lin and Dr. Zhiying Wang for their constructive suggests. This paper received funding from NSAF under Grant U1530117 and National Natural Science Foundation of China under Grant 61471022, and also sponsored by Laboratory Independent Innovation project of Qian Xuesen Laboratory of Space Technology.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] S. Ghemawat, H. Gobioff and S.-T. Leung, “The Google file system”, in Proc. ACM SIGOPS operating systems review , vol. 37, no. 5, 2003, pp. 2943.
- 2[2] D. Borthakur, “Hdfs architecture guide,” 2008. [Online]. Available: http://hadoop.apache.org/common/docs/current/hdfs design.pdf
- 3[3] H. Weatherspoon and J. D. Kubiatowicz, “Erasure coding vs. replication: A quantitative comparison,” in Proc. Peer-to-Peer Systems(IPTPS) , 2002, pp. 328–337.
- 4[4] S. C. Rhea, P. R. Eaton, D. Geels, H. Weatherspoon, B. Y. Zhao, and J. Kubiatowicz, “Pond: The oceanstore prototype,” in Proc. 2nd USENIX Conf. File and Storage Technologies(FAST) , 2003, pp. 1–14.
- 5[5] R. Bhagwan, K. Tati, Y. Cheng, S. Savage, and G. M. Voelker, “Total recall: System support for automated availability management,” in Proc. 1st Conf. Networked Systems Design and Implementation(NSDI) , 2004, pp. 25–25.
- 6[6] B. Calder, J. Wang, A. Ogus, N. Nilakantan, A. Skjolsvold, S. Mc Kelvie, Y. Xu, S. Srivastav, J. Wu, H. Simitci et al., “Windows azure storage: a highly available cloud storage service with strong consistency,” in Proc. 23rd ACM Symposium on Operating Systems Principles , 2011, pp. 143–157.
- 7[7] “Google-gfs 2 colossus,” 2012. [Online]. Available: http://www.quora.com/Colossus-Google-GFS 2.
- 8[8] A. G. Dimakis, K. Ramchandran, Y. Wu, and C. Suh, “A survey on network codes for distributed storage,” Proceedings of the IEEE , vol. 99, no. 3, pp. 476–489, 2011.
