Convertible Codes: Efficient Conversion of Coded Data in Distributed Storage
Francisco Maturana, K. V. Rashmi

TL;DR
This paper introduces a new class of codes called convertible codes that enable resource-efficient conversion of encoded data in distributed storage, reducing overhead compared to traditional re-encoding methods.
Contribution
The authors formalize code conversion, define convertible codes, and provide optimal constructions with tight bounds on resource usage in the merge regime.
Findings
Achieved tight bounds on node accesses during code conversion.
Constructed explicit MDS convertible codes optimal in the merge regime.
Provided low-field-size constructions for a broad parameter range.
Abstract
Large-scale distributed storage systems typically use erasure codes to provide durability of data in the face of failures. A set of blocks to be stored is encoded using an code to generate blocks that are then stored on different storage nodes. The redundancy configuration is chosen based on the failure rates of storage devices, and is typically kept constant. However, a recent work by Kadekodi et al. shows that the failure rate of storage devices vary significantly over time, and that adapting the redundancy configuration in response to such variations provides significant benefits. Converting the redundancy configuration of already encoded data by re-encoding requires significant overhead on resources such as accesses, device IO, network bandwidth, and compute cycles. In this work, we first present a framework to formalize the notion of code conversion: the processā¦
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies Ā· Caching and Content Delivery Ā· Distributed systems and fault tolerance
Convertible Codes: Efficient Conversion of Coded Data in Distributed Storage
Francisco Maturana and K. V. Rashmi
Computer Science Department
Carnegie Mellon University
{fmaturan, rvinayak}@cs.cmu.edu
Abstract
Large-scale distributed storage systems typically use erasure codes to provide durability of data in the face of failures. A set of blocks to be stored is encoded using an code to generate blocks that are then stored on different storage nodes. The redundancy configuration (that is, the parameters and ) is chosen based on the failure rates of storage devices, and is typically kept constant. However, a recent work by Kadekodi et al.Ā shows that the failure rate of storage devices vary significantly over time, and that adapting the redundancy configuration in response to such variations provides significant benefits: a to reduction in storage space requirement, which translates to enormous amounts of savings in resources and energy in large-scale storage systems. However, converting the redundancy configuration of already encoded data by simply re-encoding (the default approach) requires significant overhead on system resources such as accesses, device IO, network bandwidth, and compute cycles.
In this work, we first present a framework to formalize the notion of code conversionāthe process of converting data encoded with an code into data encoded with an code while maintaining desired decodability properties, such as the maximum-distance-separable (MDS) property. We then introduce convertible codes, a new class of codes that allow for code conversions in a resource-efficient manner. For an important parameter regime (which we call the merge regime) along with the widely used linearity and MDS decodability constraint, we prove tight bounds on the number of nodes accessed during code conversion. In particular, our achievability result is an explicit construction of MDS convertible codes that are optimal for all parameter values in the merge regimeĀ albeit with a high field size. We then present explicit low-field-size constructions of optimal MDS convertible codesĀ for a broad range of parameters in the merge regime. Our results thus show that it is indeed possible to achieve code conversionsĀ with significantly lesser resources as compared to the default approach of re-encoding.
I Introduction
Large-scale distributed storage systems form the bedrock of modern data processing systems. Such storage systems comprise hundreds of thousands of storage devices and routinely face failures in their day-to-day operationĀ [1, 2, 3, 4]. In order to provide resiliency against such failures, storage systems employ redundancy, typically in the form of erasure codesĀ [5, 6, 7, 8]. Under erasure coding, a set of data blocks to be stored is encoded using an code to generate coded blocks. A set of encoded blocks that correspond to the same original data blocks is called a āstripeā. Each of the coded blocks in a stripe is stored on a different storage node (typically chosen from different failure domains). The amount of redundancy added using an erasure code is a function of the redundancy configuration, that is, parameters and . These parameters are chosen so as to achieve predetermined thresholds on reliability and availability, such as the mean-time-to-data-loss (MTTDL).
The key factor that determines MTTDL for chosen parameters is the failure rate of the storage devices In a recent workĀ [9], Kadekodi et al. show that failure rates of storage devices in large-scle storage systems vary significantly over time (for example, by more than 3.5-fold for certain disk families). Thus, it is advantageous to change the redundancy configuration in response to such variations Kadekodi et al.Ā [9] present a case for tailoring erasure code parameters to the observed failure rates and show that an to reduction in storage space can be achieved by adapting the redundancy configuration according to the changing failure rates. Such a reduction in storage space requirement translates to significant savings in the cost of resources and energy consumed in large-scale storage systems.
In particular, disk failure rates exhibit a bathtub curve during the lifetime of disks, which is characterized by three phases: infancy, useful life, and wearout, in that orderĀ [9]. Disk failure rate during infancy and wearout can be multiple times higher than during useful life. As a consequence, the chosen redundancy setting will likely be too high for some periods, which is a waste of resources, and too low for other periods, which increases the risk of data loss. Kadekodi et al.Ā [9] address this problem by changing the code rate (that is, the parameters of the erasure coding scheme) as the devices go through different phases of life. For example, given a group of nodes with certain failure characteristics, the system may use a code during infancy, then convertĀ to a code during useful life, and finally convertĀ back to a code during wearout. We refer the reader toĀ [9] for an in-depth study on failure rate variations and the advantages of adapting the erasure-code parameters with these variations.
Adapting the redundancy configuration requires modifying the code rate for all the stripes that have at least one block stored on a certain disk group when the failure rate of that disk group changes by more than a threshold amountĀ [9]. Changing the code rate, that is the parameters of the erasure code, employed on already encoded data can be highly resource intensive, potentially requiring to access multiple storage devices, read large amounts of data, transfer it over the network, and re-encode it. Modifying the code parameters using the default approach requires reading at least blocks from each stripe, transferring over the network and re-encoding. In large-scale storage systems, disks are deployed in large batches, and hence a large number of disks go through failure-rate transitions concurrently. Thus, adapting redundancy configuration by using the default approach of re-encoding generates highly varying and prohibitively large load spikes, which adversely affect the foreground traffic. This places significant burden on precious cluster resources such as accesses, disk IO, network bandwidth, and computation cycles (CPU). Furthermore, in some cases these conversions need be performed urgently, such as the case where there is an unexpected rise in failure rates and conversionĀ is necessary to reduce the risk of data loss. In such cases, it is necessary to be able to perform fast conversions. Motivated by these applications, in this paper, we initiate a formal study of such code conversions by exploring the following questions:
- ā¢
What are the fundamental limits on resource consumption of code conversions?
- ā¢
How can one design codes that efficiently facilitate code conversions?
Formally, the goal is to convertĀ data that is already encoded using an code (denoted by ) into data encoded using an code (denoted by )111The superscripts and stand for initial and final respectively, representing the initial and final state of the conversion., with desired constraints on decodability such as both initial and final codes satisfying the maximum-distance-separable (MDS) property. Clearly, it is always possible to read the original data (and decode if needed) and re-encode according to . However, such a re-encoding approach requires accessing several nodes ( nodes per stripe for MDS codes), reading out all the data, transferring over the network, and re-encoding, which consumes large amounts of access, disk IO, network bandwidth, and CPU resources.
The question then is whether one can perform such conversionsĀ in a more resource-efficient manner, while satisfying the decodability constraints. We now present an example showing how resource-efficient conversionĀ can be achieved in a simple manner for certain parameters.
Example 1**.**
Consider , and , with the requirement that both and are MDS. This conversionĀ can be achieved by āmergingā two stripes of the initial code into one stripe, for each stripe of the final code. Let us focus on the number of blocksĀ accessed during conversion. Using the default approach of re-encoding to achieve the conversionĀ requires accessing blocksĀ from two stripes of encoded data under (initial stripes) to create one stripe of encoded data under (final stripe). That is, each stripe of encoded data under the final code requires accessing blocks. Alternatively, as depicted in FigureĀ 1, one can choose and to be systematic, single-parity-check codes, with the parity blockĀ holding the XOR of the data blocksĀ in each stripe (shown with a shaded box in the figure). To convert from to , one can compute the XOR between the single parity in each stripe, and store the result as the parity blockĀ for the stripe under . This alternative approach requires accessing only two blocksĀ for each final stripe, and thus is significantly more efficient in the number of accessed blocksĀ as compared to the default approach.
In this paper, we first propose a novel framework that formalizes the concept of code conversion, that is, the process of converting data encoded with an code into data encoded with an code while maintaining desired decodability properties, such as maximum-distance-separable (MDS) property. We then introduce a new class of code pairs, which we call convertible codes, which allow for resource-efficient conversions. We begin the study of this new class of code pairs, by focusing on an important regime where for any integer with arbitrary values of and , which we call the merge regime. Furthermore, we focus on the access cost of code conversion, which corresponds to the total number of nodes that participate in the conversion. Keeping the number of nodes accessed small makes conversionĀ less disruptive and allows the unaffected nodes to remain available for serving client requests. In addition, reducing the number of accesses also reduces disk IO, network bandwidth and CPU consumed.
We prove tight bounds on the access costĀ of conversionsĀ for linear MDS codes in the merge regime. In particular, our achievability result is an explicit construction of MDS convertible codes that are access-optimal for all parameters values in the merge regimeĀ albeit with a high field size. Finally, we present a sequence of practical low-field-size constructions of access-optimalĀ MDS convertible codesĀ in the merge regimeĀ based on Hankel arrays. These constructions lead to a tradeoff between field size and the parameter values they cover with the two extreme points corresponding to (1) requiring a field size , and (2) requiring a field size . Thus, our results show that code conversionsĀ can be achieved with a significantly lesser resource overhead as compared to the default approach of re-encoding. Furthermore, all the constructions presented have the added benefit that they continue to be optimal for a wide range of parameters, which allows to handle the case where the parameters of the final code are unknown a priori.
The rest of the paper is organized as follows. SectionĀ II discusses related work. SectionĀ III formalizes the notion of code conversionsĀ and presents a framework for studying convertible codes. SectionĀ IV shows the derivation of lower bounds on the access costĀ of conversionsĀ for linear MDS codes in the merge regime. SectionĀ V describes a general explicit construction for MDS codes in the merge regime that meets the access costĀ lower bounds, albeit with a high field size. SectionĀ VI describes low-field-size constructions for MDS codes in the merge regime, which provide a tradeoff between field size and range of parameter values they cover. Finally, SectionĀ VII presents our conclusions and discuss future directions.
II Related work
There is extensive literature on the use of erasure codes for reliable data storage. In storage systems, failures can be effectively modeled as erasures, and thereby, erasure codes can be used to provide tolerance to failures, at the cost of some storage overheadĀ [10, 11]. Maximum distance separable (MDS) codes are often used for this purpose, since they achieve the optimal tradeoff between failure tolerance and storage overhead. A well-known and often-used family of MDS codes is Reed-Solomon codesĀ [12].
When using erasure codes in storage systems, a host of other overheads and performance metrics, in addition to storage overhead, comes into picture. Encoding/decoding complexity, node repair performance, degraded read performance, field size, and other metrics can significantly affect real system performance. Several works in the literature have studied these aspects.
The encoding and decoding of data, and the finite field arithmetic that they require, can be compute intensive. Motivated by this, array codesĀ [13, 14, 15, 16] are designed to use XOR operations exclusively, which are typically faster to execute, and aim to decrease the complexity of encoding and decoding.
The repair of failed nodes can incur a large amount of data read and transfer, burdening device IO and network bandwidth. Several approaches have been proposed to alleviate the impact of repair operations. Dimakis et al.Ā [17] proposed a new class of codes called regenerating codes that minimize the amount of network bandwidth consumed during repair operations. Several explicit constructions of regenerating codes have been proposed (for example, see [18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28]) as well as generalizations (for example, see [29, 30, 31]). It has been shown that meeting the lower bound on the repair bandwidth requirement when MDS property and high rate are desired necessitates a large value for the so called āsub-packetizationā [32, 33, 34, 35], which negatively affects certain key performance metrics in storage systemsĀ [3]. To overcome this issue, several worksĀ [36, 37, 38] have proposed code constructions that relax the requirement of meeting lower bounds on IO and bandwidth requirements for repair operations. For example, the Piggybacking frameworkĀ [37] provides a general framework to construct repair-efficient codes by transforming any existing codes, while allowing a small sub-packetization (even as small as ). The above discussed works construct vector codes in order to improve the efficiency of repair operation. The papersĀ [39, 40, 41] propose repair algorithms for (scalar) Reed-Solomon codes that reduce the network bandwidth consumed during repair by downloading elements from a subfield rather than the finite field over which the code is constructed. Network bandwidth consumed is another metric to optimize for during conversion. In this paper, we only focus on the access cost.
Another class of codes, called local codesĀ [42, 43, 44, 45, 46, 47, 48, 49, 50, 51], focuses on the locality of codeword symbols during repair, that is, the number of nodes that need to be accessed when repairing a single failure. Local codes improve repair and degraded read performance, since missing information can be recovered without having to recover the full data. The locality metric for repair that local codes optimize for is similar to the access cost metric for conversionĀ that we optimize for in this work as both these metrics aim to minimize the number of nodes accessed.
There are several classical techniques for creating new codes from existing onesĀ [12]. For example, techniques such as puncturing, extending, shortening, and others which can be used to modify codes. These techniques, however, do not consider the cost of performing such modifications to data that is already encoded, which is the focus of our work.
Several works [52, 53] study the problem of two stage encoding: first generating a certain number of parities during the encoding process and then adding additional parities. As discussed inĀ [52], adding additional parities can be conceptually viewed as a repair process by considering the new parity nodes to be generated as failed nodes. Furthermore, as shown inĀ [19], for MDS codes, the bandwidth requirement for repair of even a single node is lower bounded by the same amount as in regenerating codes that require repair of all nodes. Thus one can always employ a regenerating code to add additional parities with minimum bandwidth overhead. However, when MDS property and high rate are desired, as discussed above, using regenerating codes requires a large sub-packetization. The paperĀ [53] employs the Piggybacking frameworkĀ [36, 37] to construct codes that overcome the issue of large sub-packetization factor. The scenario of adding a fixed number of additional parities, when viewed under the setting of conversions, corresponds to having and .
Another related work [54] proposes a storage system that uses two erasure codes. One of the codes prioritizes the network bandwidth required for recovery, while the other prioritizes storage overhead, and data is convertedĀ between the two codes according to the workload. This application constitutes another motivation for resource-efficient conversions. To reduce the cost of code conversion, the systemĀ [54] uses product codesĀ [12] and locally repairable codesĀ [44], and the local parities are leveraged during conversion. The authors, however, choose codes from these two families ad hoc, and do not focus on the problem of designing these codes to minimize the cost of code conversion.
Several worksĀ [55, 56, 57] study the update operation in erasure coded storage systems, and the problem of maintaining consistency in such mutable storage systems. The cost of updates is another metric to optimize for in convertible codes, which we do not consider in this paper. In the current paper, the focus is on immutable storage systems which comprise a vast majority of large-scale storage systems.
III A framework for studying code conversions
In this section, we formally define and study code conversions and introduce convertible codes.
Suppose one wants to convertĀ data that is already encoded using an initial code into data encoded using an final code . Assume, without loss of generality, that each node has a fixed storage capacity . In the initial and final configurations, the system stores the same information, but encoded differently. In order to capture the changes in the dimension of the code during conversion, we consider number of āmessageā symbols (i.e., the data to be stored) over a finite field , denoted by . This corresponds to multiple stripes in the initial and final configurations. We note that this need for considering multiple stripes in order to capture the smallest instance of the problem deviates from existing literature on the repair problem in distributed storage codes where a single stripe is sufficient to capture the problem.
Since there are multiple stripes, we first specify an initial partition and a final partition of the set , which map the message symbols of to their corresponding initial and final stripes. The initial partition is composed of disjoint subsets of size , and the final partition is composed of disjoint subsets of size . In the initial (respectively, final) configuration, the data indexed by each subset is encoded using the code . The codewords are referred to as initial stripes, and the codewords are referred to as final stripes, where corresponds to the projection of onto the coordinates in and is the encoding of under code . We now formally define code conversionĀ and convertible codes.
Definition 1** (Code conversion).**
A conversion from an initial code to a final code with initial partition and final partition is a procedure, denoted by , that for any , takes the set of initial stripes as input, and outputs the corresponding set of final stripes .
The descriptions of the initial and final partitions and codes, along with the conversionĀ procedure, define a convertible code.
Definition 2** (Convertible code).**
A convertible codeĀ over is defined by: (1) a pair of codes where is an code over and is an code over ; (2) a pair of partitions of such that each subset in is of size and each subset in is of size ; and (3) a conversionĀ procedure that on input outputs for all .
In addition, typically additional constraints on the distance (i.e., decodability) of the codes and would be imposed, such as requiring both codes to be MDS.
Example 2**.**
Suppose we want to transition from a code to a code . We consider data of length . In the initial configuration, the data is partitioned into three stripes, each one composed of three blocksĀ encoding two message symbols. For example, if then the initial stripes are , and . In the final configuration, the data is partitioned into two stripes, each one composed of five blocksĀ encoding three message symbols. For example, if then the final stripes are , and . Note that a different valid final partition could have been .
The conversionĀ procedure must take as input, and output . In this example, the codes , the partitions , and procedure define a convertible code.
Remark 1*.*
Note that the definition of convertible codes (DefinitionĀ 2) assumes that are fixed a priori, and are known at code construction time. This will be helpful in understanding the fundamental limits of the conversion process. In practice, this assumption might not always hold. For example, the parameters depend on the node failure rates that are yet to be observed. Interestingly, it is indeed possible for a convertible codeĀ to facilitate conversionĀ for multiple values of , as is the case for the code constructions presented in this paper.
The overhead of conversion in a convertible code is determined by the cost of the conversionĀ procedure , as a function of the parameters . Towards minimizing the overhead of the conversion, our general objective is to design codes , partitions and conversionĀ procedure that satisfy DefinitionĀ 2 and minimize the conversionĀ cost for given parameters , subject to desired decodability constraints on and .
Depending on the relative importance of various resources in the cluster, one might be interested in optimizing the conversionĀ with respect to various types of costs such as access, network bandwidth, disk IO, CPU, etc., or a combination of these costs. The general formulation of code conversionsĀ above provides a powerful framework to theoretically reason about convertible codes. In what follows, we will focus on a specific regime and a specific cost model.
IV Lower bounds on access costĀ of code conversion
The focus of this section is on deriving lower bounds on the access cost of code conversion. We consider one of the fundamental regimes of convertible codes, that corresponds to merging several initial stripes of a code into a single, longer final stripe. Specifically, the convertible codesĀ in this regime have , where is the number of initial stripes merged, with arbitrary values of and . We call this regime as merge regime. We additionally require that both the initial and final code are linear and MDS. Since linear MDS codes are widely used in storage systems and are well understood in the Coding Theory literature, they constitute a good starting point.
We focus on the access cost of code conversion, that is, the number of blocksĀ that are affected by the conversion. The access costĀ of conversionĀ measures the total number of blocksĀ accessed during conversion. Each new blockĀ needs to be written, and hence requires accessing a node. Similarly, each blockĀ from the initial stripes that is read, requires accessing a node. Therefore, minimizing access costĀ amounts to minimizing the sum of the number of new blocksĀ written and the number of blocksĀ read from the initial stripes.222Readers who are familiar with the literature on regenerating codes might observe that convertible codes optimizing for the access cost are āscalarā codes as opposed to being āvectorā codes. Keeping this number small makes code conversionĀ less disruptive and allows the unaffected nodes to remain available for application-specific purposes throughout the procedure, for example, to serve client requests in a storage system. Furthermore, reducing the number of accesses also reduces disk IO, network bandwidth and CPU consumed.
In SectionĀ V, we will show that the lower bounds on the access cost derived in this section are in fact achievable. Therefore, we refer to MDS convertible codesĀ in the merge regimeĀ that achieve these lower bounds as access-optimal.
Definition 3** (Access-optimal).**
A linear MDS convertible codeĀ is said to be access-optimal if and only if it attains the minimum access costĀ over all linear MDS convertible codes.
We first start with a description of the notation in SectionĀ IV-A and then derive lower bounds on the access cost in SectionĀ IV-B.
IV-A Notation
Let be an MDS code over field , specified by generator matrix , with columns (that is, encoding vectors) . Let be an integer, and let be an MDS code over field , specified by generator matrix , with columns (that is, encoding vectors) . Let and . When and are systematic, and correspond to the initial number of parities and final number of parities, respectively. All vectors are assumed to be column vectors. We will use the notation to denote the -th coordinate of a vector .
We will represent all the code symbols in the initial stripes as being generated by a single matrix , with encoding vectorsĀ . This representation can be viewed as embedding the column vectors of the generator matrix in an -dimensional space, where the index set corresponds to the encoding vectorsĀ for initial stripe . Let denote the -th encoding vector in the initial stripe in this (embedded) representation. Thus, for , and otherwise. As an example, FigureĀ 2 shows the values of the defined terms for the single parity-check code from FigureĀ 1 with .
At times, focus will be only on the coordinates of an encoding vectorĀ of a certain initial stripe . For this purpose, define to be the projection of to the coordinates in an index set , and for a set of vectors, . For example, for all and .
The following sets of vectors are defined: the encoding vectorsĀ from initial stripe , , all the encoding vectorsĀ from all the initial stripes, , and all the encoding vectorsĀ from the final stripe .
We use the term unchanged blocks to refer to blocksĀ from the initial stripes that remain as is (that is, unchanged) in the final stripe. The blocksĀ in the final stripe that were not present in the initial stripes are called new, and the blocksĀ from the initial stripes that do not carry over to the final stripe are called retired. For example, in FigureĀ 1, all the data blocksĀ are unchanged blocksĀ (unshaded boxes), the single parity blockĀ of the final stripe is a new block, and the two parity blocks from the initial stripes are retired blocks. Each unchanged blockĀ corresponds to a pair of identical initial and final encoding vectors, that is, a tuple of indices such that . For instance, the example in FigureĀ 1 has four unchanged blocks, corresponding to the identical encoding vectorsĀ for . The final encoding vectorsĀ can thus be partitioned into the following sets: unchanged encoding vectors from initial stripe , for all , and new encoding vectors .
From the point of view of conversionĀ cost, unchanged blocksĀ are ideal, because they require no extra work. On the other hand, constructing new blocksĀ require accessing blocksĀ from the initial stripes. When a blockĀ from the initial stripes is accessed, all of its contents are downloaded to a central location, where they are available for the construction of all new blocks. For example, in FigureĀ 1, one blockĀ from each initial stripe is accessed during conversion.
During conversion, new blocksĀ are constructed by reading blocksĀ from the initial stripes. That is, every new encoding vectorĀ is simply a linear combination of a specific subset of . Define the read access set for an MDS convertible codeĀ as the set of tuples such that the set of new encoding vectors is contained in the span of the set . Furthermore, define the index sets , which denote the encoding vectorsĀ accessed from each initial stripe.
IV-B Lower bounds on the access costĀ of code conversion
In this subsection, we present lower bounds on the access costĀ of linear MDS convertible codesĀ in the merge regime. This is done in four steps:
We show that in the merge regime, all possible pairs of partitions and partitions are equivalent up to relabeling, and hence do not need to be specified. 2. 2.
An upper bound on the maximum number of unchanged blocksĀ is proved. We call convertible codesĀ that meet this bound as āstableā. 3. 3.
Lower bounds on the access costĀ of linear MDS convertible codesĀ are proved, under the added restriction that the convertible codesĀ are stable. 4. 4.
The stability restriction is removed, by showing that non-stable linear MDS convertible codesĀ necessarily incur higher access cost, and hence it suffices to consider only stable MDS convertible codes.
We now start with the first step. In the general regime, partition functions need to be specified since they indicate how message symbols from the initial stripes are mapped into the final stripes. In the merge regime, however, there is only one final stripe, and hence the choice of the partition functions does not matter.
Proposition 1**.**
For every convertible code, all possible pairs of initial and final partitions are equivalent up to relabeling.
Proof.
Given that , there is only one possible final partition . Thus, regardless of , all data in the initial stripes will get mapped to the same final stripe. By relabeling blocks, any two initial partitions can be made equivalent. ā
Thus, the analysis of convertible codesĀ in the merge regimeĀ in this regime can be simplified by noting that the choice of partitions and is inconsequential.
Since one of the terms in access costĀ is the number of new blocks, a natural way to reduce access costĀ is to maximize the number of unchanged blocks. However, there is a limit on the number of blocksĀ that can remain unchanged.
Proposition 2**.**
In an MDS convertible code, there can be at most unchanged vectors from each initial stripe. Thus, there can be at most unchanged vectors in total, or in other words, there will be at least new vectors.
Proof.
Every subset of size at least is linearly dependent, and thus if then cannot be MDS. Hence, for each stripe , the amount of unchanged vectors is at most . ā
Since new blocksĀ are constructed using only the contents of blocksĀ read, it is clear that both the quantities that make up access costĀ are going to be related. Intuitively, more new blocksĀ means that more blocksĀ need to be read, resulting in higher access cost. With this intuition in mind, we will first focus on the case where the number of new blocksĀ is the minimum: . We refer to such codes as stable convertible codes.
Definition 4** (Stability).**
An MDS convertible codeĀ is stable if and only if it has exactly unchanged blocks, or in other words, exactly new blocks.
We first prove lower bounds on the access cost of stable linear MDS convertible codes, and then show that the access cost of conversionĀ in MDS codes without this stability property can only be higher.
A natural question now is characterizing the minimum size of the read access set for conversion for MDS codes. Clearly, accessing blocksĀ from each initial stripe will always suffice, since this is sufficient to decode all the original data. Thus, in a minimum size we can upper bound the size of each by .
The first lower bound on the size of will be given by the interaction between and the MDS property.
Lemma 3**.**
For all linear stable MDS convertible codes, the read access set from each initial stripe satisfies .
Proof.
By the MDS property, every subset of size at most is linearly independent. For any initial stripe , consider the set of all unchanged encoding vectorsĀ from other stripes, , and pick any subset of new encoding vectorsĀ of size . Consider the subset : it is true that and . Therefore, all the encoding vectorsĀ in are linearly independent.
Notice that the encoding vectorsĀ in contain no information about initial stripe and complete information about every other initial stripe . Therefore, the information about initial stripe in each encoding vectorĀ in has to be linearly independent since, otherwise, could not be linearly independent. Formally, it must be the case that has rank equal to (recall from SectionĀ IV-A that is the set of coordinates belonging to initial stripe ). However, by definition, the subset must be contained in the span of . Therefore, the rank of is at least that of , which implies that . ā
Therefore, in general we need to access at least vectors from each initial stripe, unless , in which case we need to access encoding vectors, that is, the full data.
We next show that, in a linear MDS stable convertible codeĀ in the merge regime, when the number of new blocksĀ is bigger than , at least blocksĀ need to be accessed from each initial stripe. The intuition behind this result is the following: in an MDS stable convertible codeĀ in the merge regime, when the number of new blocksĀ is bigger than , during a conversion one is forced to read more than blocks. Hence there must exist blocksĀ from the initial stripes that are both unchanged and are read during conversion. Since the unchanged blocks that are read are also present in the final stripe, the information read from these blocksĀ is not useful in creating a new blockĀ that retains the MDS property for the final code unless blocks (that is, full data) are read.
Lemma 4**.**
For all linear stable MDS convertible codes, if then the read access set from each initial stripe satisfies .
Proof.
When , this lemma is equivalent to LemmaĀ 3, so assume . From the proof of LemmaĀ 3, for every initial stripe it holds that . Since , this implies that must contain at least one index of an unchanged encoding vector.
Choose a subset of at most encoding vectorsĀ from , which must be linearly independent by the MDS property. In this subset, include all the unchanged encoding vectorsĀ from the other initial stripes, . Then, choose all the unchanged encoding vectorsĀ from initial stripe that are accessed during conversion, . For the remaining vectors (if any), choose an arbitrary subset of new encoding vectors, , such that:
[TABLE]
It is easy to check that the subset is of size at most , and therefore it is linearly independent. This choice of follows from the idea that the information contributed by to the new encoding vectorsĀ is already present in the unchanged encoding vectors, which will be at odds with the linear independence of .
Since the elements of and are the only encoding vectorsĀ in that contain information from initial stripe , it must be the case that has rank . Moreover, is contained in the span of by definition, so it holds that:
[TABLE]
From EquationĀ 1, there are two cases:
Case 1: . Then and by EquationĀ 2 it holds that .
Case 2: . Then and by EquationĀ 2 it holds that:
[TABLE]
Notice that there are only retired (i.e.Ā not unchanged) encoding vectorsĀ in stripe . Since every accessed encoding vectorĀ is either in or is a retired encoding vector, it holds that:
[TABLE]
By combining EquationĀ 3 and EquationĀ 4, we arrive at the contradiction , which occurs because there are not enough retired blocksĀ in the initial stripe to ensure that the final code has the MDS property. Therefore, case 1 always holds, and . ā
Combining the above results leads to the following theorem on the lower bound of read access set size of linear stable MDS convertible codes.
Theorem 5**.**
Let denote the minimum integer such that there exists a linear stable MDS convertible codeĀ with read access set of size . For all valid parameters, . Furthermore, if , then .
Proof.
Follows directly from LemmaĀ 3 and LemmaĀ 4. ā
So far we have focused on deriving lower bounds on the access cost of conversionĀ for stable MDS convertible codes, which have the maximum number of unchanged blocks. That is, convertible codes that have unchanged blocksĀ and new blocks. We next show that this lower bound generally applies even for non-stable convertible codes by proving that increasing the number of new blocksĀ from the minimum possible does not decrease the lower bound on the size of the read access set .
Lemma 6**.**
The lower bounds on the size of the read access set from TheoremĀ 5 hold for all (including non-stable) linear MDS convertible codes.
Proof.
We show that, even for non-stable convertible codes, that is, when there are more than new blocks, the bounds on the read access set from TheoremĀ 5 still hold.
Case 1: . Let be an arbitrary initial stripe. We lower bound the size of by invoking the MDS property on a subset of size that minimizes the size of the intersection . There are exactly encoding vectorsĀ in , so the minimum size of the intersection is . Clearly, the subset has rank due to the MDS property. Therefore, it holds that . By reordering, the following is obtained:
[TABLE]
which means that the bound on established in LemmaĀ 3 continues to hold for non-stable codes.
Case 2: . Let be an arbitrary initial stripe, let be the unchanged encoding vectorsĀ that are accessed during conversion, and let be the unchanged encoding vectorsĀ that are not accessed during conversion. Consider the subset of encoding vectorsĀ from the final stripe such that and the size of the intersection is minimized. Since may exclude at most encoding vectorsĀ from the final stripe, it holds that:
[TABLE]
By the MDS property, is a linearly independent set of encoding vectorsĀ of size , and thus, must contain all the information to recover the contents of every initial stripe, and in particular, initial stripe . Since all the information in about stripe is in either or the accessed encoding vectors, it must hold that:
[TABLE]
From EquationĀ 5, there are two cases:
Subcase 2.1: . Then , and by EquationĀ 6 it holds that , which matches the bound of LemmaĀ 4.
Subcase 2.2: . Then , and by EquationĀ 6 it holds that:
[TABLE]
The initial stripe has blocks. By the principle of inclusion-exclusion we have that:
[TABLE]
By using EquationĀ 7, EquationĀ 8 and the fact that , we conclude that , which is a contradiction and means that subcase 2.1 always holds in this case. ā
The above result, along with the fact that the lower bound in TheoremĀ 5 is achievable (as will be shown in SectionĀ V), implies that all access-optimalĀ linear MDS convertible codes in the merge regimeĀ have the minimum possible number of new blocksĀ (which is as shown in 2), that is they are stable.
Lemma 7**.**
All access-optimalĀ MDS convertible codesĀ are stable.
Proof.
LemmaĀ 6 shows that the lower bound on the read access set for stable linear MDS convertible codes continues to hold in the non-stable case. Furthermore, this bound is achievable by stable linear MDS convertible codesĀ in the merge regime (as will be shown in SectionĀ V). The number of new blocks written during conversionĀ under stable MDS convertible codes is . On the other hand, the number of new blocksĀ under a non-stable convertible codeĀ is strictly greater than . Thus, the overall access costĀ of a non-stable MDS convertible codeĀ is strictly greater than the access costĀ of an access-optimalĀ convertible code. ā
Thus, for MDS convertible codesĀ in the merge regime, it suffices to focus only on stable codes. Combining all the results above, leads to the following key result.
Theorem 8**.**
For all linear MDS convertible codes, the access costĀ of conversionĀ is at least . Furthermore, if , the access costĀ of conversionĀ is at least .
Proof.
Follows from TheoremĀ 5, LemmaĀ 6, and the definition of access cost. ā
In SectionĀ V we show that the lower bound of TheoremĀ 8 is achievable for all parameters. Thus, TheoremĀ 8 implies that it is possible to perform conversionĀ of MDS convertible codesĀ in the merge regimeĀ with significantly less access costĀ than the naĆÆve strategy if and only if . For example, for an MDS convertible code the naĆÆve strategy has an access costĀ of , while the optimal access costĀ is , which corresponds to savings in access costĀ of .
V Achievability: Explicit access-optimalĀ convertible codes in the merge regime
In this section, we present an explicit construction of access-optimalĀ MDS convertible codesĀ for all parameters in the merge regime. In SectionĀ V-A, we describe the construction of the generator matrices for the initial and final code. Then, in SectionĀ V-B, we prove that the code described by this construction has optimal access cost during code conversion.
V-A Explicit construction
Recall that, in the merge regime, , for any integer and arbitrary and . Also, recall that and . Notice that when , or , constructing an access-optimalĀ convertible codeĀ is trivial. In those cases, one can simply access all the data blocks of the initial stripes, which meets the bound stated in TheoremĀ 5. Thus, assume .
Let be the generator matrices of respectively. Our construction is systematic, that is, both and are systematic MDS codes. Thus are of the form and , where is a matrix and is a matrix. Therefore, to define the initial and final code, only and need to be specified. Let be a finite field of size , where is any prime and the degree depends on the convertible codeĀ parameters and will be specified later in this section. Let be a primitive element of .
Define entry of as , where ranges over . Entry of is defined in an identical fashion, as , where ranges over .
For example, for , the matrices and would be:
[TABLE]
Our explicit construction is stable (recall from LemmaĀ 7 that all access-optimalĀ MDS convertible codesĀ in the merge regimeĀ are stable), that is, it has exactly unchanged encoding vectors. Given that our construction is also systematic it follows that these unchanged encoding vectorsĀ correspond exactly to the systematic elements of .
V-B Proof of optimal access cost during conversion
Throughout this section, we use the following notation for submatrices: let be a matrix, the submatrix of defined by row indices and column indices is denoted by . For conciseness, we use to denote all row or column indices, e.g., denotes the submatrix composed by columns , and denotes the submatrix composed by rows .
We first recall an important fact about systematic MDS codes.
Proposition 9** ([12]).**
Let be an code with generator matrix . Then is MDS if and only if is superregular, that is, every square submatrix of is nonsingular333This definition of superregularity is different from the definition introduced inĀ [58], which is sometimes used in the context of convolutional codes.. ā
Thus, to be MDS, both and need to be superregular.
From the bound in LemmaĀ 3, to be access-optimal during conversionĀ when , the columns of (that is, the new encoding vectors) have to be such that they can be constructed by only accessing columns of (that is, the initial encoding vectors) during conversion. Thus, it suffices to show that the columns of can be constructed by accessing only columns of during conversion. To capture this property, we introduce the following definition.
Definition 5** (-column constructible**).
We will say that an matrix is -column constructible from an matrix if and only if there exists a subset of size , such that the columns of are in the span of . We say that a matrix is -column block-constructible from an matrix if and only if for every , the submatrix is -column constructible from .
Theorem 10**.**
A systematic convertible codeĀ with initial parity generator matrix and final parity generator matrix is MDS and access-optimal, if the following two conditions hold: (1) if then is -column block-constructible from , and (2) are superregular.
Proof.
Follows from 9 and DefinitionĀ 5. ā
Thus, we can reduce the problem of proving the optimality of a systematic MDS convertible codeĀ in the merge regimeĀ to that of showing that matrices and satisfy the two properties mentioned in TheoremĀ 10.
We first show that the construction specified in SectionĀ V-A satisfies condition (1) of TheoremĀ 10.
Lemma 11**.**
Let be as defined in SectionĀ V-A. Then is -column block-constructible from .
Proof.
Consider the first columns of , which we denote as . Notice that can be written as the following block matrix:
[TABLE]
where is the diagonal matrix with as the diagonal elements. From this representation, it is clear that can be constructed from the the first columns of . ā
It only remains to show that the construction specified in SectionĀ V-A satisfies condition (2) of TheoremĀ 10, that is, that and are superregular. To do this, we consider the minors of and as polynomials on . We show that, due to the structure of the the matrices and as specified in SectionĀ V-A, none of these polynomials can have as a root as long as the field size is sufficiently large. Therefore none of the minors can be zero.
Lemma 12**.**
Let be as defined in SectionĀ V-A. Then and are superregular, for sufficiently large field size.
Proof.
Let be a submatrix of or , determined by the row indices and the column indices , and denote entry of as . The determinant of is defined by the Leibniz formula:
[TABLE]
is the set of all permutations on elements, and is the sign of the permutation (the sign of a permutation depends on the number of inversions in ). Clearly, defines a univariate polynomial . We will now show that by showing that there is a unique permutation for which achieves this value, and that this is the maximum over all permutations in . This means that has a leading term of degree .
To prove this, we show that any permutation can be modified into a permutation such that . Specifically, we show that , the identity permutation. Consider : let be the smallest index such that , let , and let . Let be such that , , and for . In other words, is the result of āswappingā the images of and in . Notice that and . Then, we have that:
[TABLE]
The last inequality comes from the fact that implies and implies . Therefore, .
Let be the maximum degree of over all submatrices of or . Then, corresponds to the diagonal with the largest elements in or . In this is the diagonal of the square submatrix formed by the bottom rows. In it can be either the diagonal of the square submatrix formed by the bottom rows, or by the right columns. Thus, we have that:
[TABLE]
Let . Then, if for some submatrix , is a root of , which is a contradiction since is a primitive element and the minimal polynomial of over has degree [12]. ā
This construction is practical only for small values of these parameters since the required field size grows rapidly with the lengths of the initial and final codes. In SectionĀ VI we present practical low-field-size constructions.
Combining the above results leads to the following key result on the achievability of the lower bounds on access cost derived in SectionĀ IV.
Theorem 13**.**
The explicit construction provided in SectionĀ V-A yields access-optimalĀ linear MDS convertible codesĀ for all parameter values in the merge regime.
Proof.
Follows from TheoremĀ 10, LemmaĀ 11, and LemmaĀ 12. ā
VI Low field-size constructions based on superregular Hankel arrays
In this section we present alternative constructions for convertible codesĀ that require a significantly lower (polynomial) field size than the general construction presented in SectionĀ V.
Key idea. The key idea behind our constructions is to take the matrices and as submatrices from a specially constructed triangular array of the following form:
[TABLE]
such that every submatrix of is superregular. Here, (1) are (not necessarily distinct) elements from , and (2) is at most the field size . The array is said to have Hankel form, which means that , for all . We denote a superregular Hankel array. Such an array can be constructed by employing the algorithm proposed in [59] (where the algorithm was employed to construct generalized Cauchy matrices to yield generalized Reed-Solomon codes). We note that the algorithm outlined in [59] takes the field size as input, and generates as the output. It is easy to see that thus generated can be truncated to generate the triangular array for any .
We construct the initial and final codes by taking submatrices and from superregular Hankel arrays (the submatrices have to be contained in the triangle where the array is defined). This guarantees that and are superregular. In addition, we exploit the Hankel form of the array by carefully choosing the submatrices that form and to ensure that is -column block-constructible from . Given the way we construct these matrices and the properties of , all the initial and final codes presented in this subsection are generalized doubly-extended Reed-Solomon codes [59].
The above idea yields a sequence of constructions with a tradeoff between the field size and the range of supported. We first present the two constructions at the extreme ends of this tradeoff, which we call Hankel-I and Hankel-II. Construction Hankel-I, described in SectionĀ VI-A, can be applied whenever , and requires a field size of . Construction Hankel-IIĀ , described in SectionĀ VI-B, can be applied whenever , and requires a field size of . We then discuss the constructions that fall in between these two constructions in the tradeoff between field size and coverage of values in SectionĀ VI-C. In SectionĀ VI-C we also provide a discussion on the ability of these constructions to be optimal even when parameters of the final code are a priori unknown. Throughout this section we will assume that . The ideas presented here are still applicable when , but the constructions and analysis change in minor ways.
VI-A Hankel-IĀ construction
Hankel-IĀ construction provides an access-optimalĀ linear MDS convertible codeĀ when , and requires a field size of . Notice that this construction has no penalty in terms of field size for access-optimalĀ conversion, since it has the same field size requirement as the maximum between a pair of and Reed-Solomon codes [12]. We start by illustrating the construction with an example.
Example 3**.**
Consider the parameters . First, we construct a superregular Hankel array of size , , employing the algorithm inĀ [59]. Then choose and from as shown in FigureĀ 3. Checking that these matrices are superregular follows from the superregularity of . Furthermore, notice that the chosen parity matrices have the following structure:
[TABLE]
From this structure, it is clear that is -column block-constructible from . The field size required for this construction is .
General construction. Now we describe how to construct for all valid parameters , where . As seen in ExampleĀ 3, this construction works by splitting the encoding vectorsĀ corresponding to the initial parities into groups, which are then combined to obtain the (at most) new encoding vectors.
Let be as defined in EquationĀ 13, with . Choose to be the submatrix of the top-left elements of . Denote the submatrix of the top-left elements of as Q:
[TABLE]
We choose to be any submatrix of that includes columns . The Hankel form of array implies that for all . As a consequence, we have that the -th column of is equal to the vertical concatenation of columns of .
Since both and are submatrices of , they are superregular. Furthermore, since every column of is the concatenation of columns of , it is clear that is -column block-constructible from . Thus and satisfy both the sufficient properties laid out in TheoremĀ 10, and hence Hankel-IĀ construction is access-optimal during conversion.
(Access-optimal) Conversion process. During conversion, the data blocksĀ from each of the initial stripes remain unchanged, and become the data blocksĀ from the final stripe as detailed below. The new (parity) blocks from the final stripe are constructed by accessing blocksĀ from the initial stripes. To construct the -th new blockĀ (corresponding to the -th column of , ), read parity blockĀ from each initial stripe , and then sum the blocksĀ read. The encoding vectorĀ of the new blockĀ will be equal to the sum of the encoding vectorsĀ of the blocksĀ read (recall from SectionĀ IV-A that the initial encoding vectorsĀ are embedded into a dimensional space). This is done for every new encoding vectorĀ .
VI-B Hankel-IIĀ construction
Hankel-IIĀ construction, in contrast to the Hankel-IĀ construction above, can handle a broader range of parameter values, at the cost of a slightly larger field-size requirement. In particular, we present a construction of access-optimalĀ MDS convertible codeĀ for all , requiring a field size of . We start with an example illustrating this construction.
Example 4**.**
Consider parameters . First, we construct a superregular Hankel array of size , , by choosing as the field size, and employing the algorithm in [59]. Then choose and from as shown in FigureĀ 4. Both matrices are superregular by the superregularity of . Notice that the chosen parity matrices have the following structure:
[TABLE]
It is easy to see that is -column block-constructible from .
General construction. Now we describe how to construct and for all valid parameters such that . As seen in ExampleĀ 4, this construction works by choosing the initial parity encoding vectorsĀ so that any consecutive initial parity encoding vectorsĀ can be combined into a new encoding vector.
Let be as in EquationĀ 13, with . We take and as the following submatrices of :
[TABLE]
The Hankel form of array guarantees that the -th column of corresponds to the concatenation of columns of . Thus, is -column block-constructible from . Furthermore, since and are submatrices of , they are superregular.
(Access-optimal) Conversion process. During conversion, the data blocksĀ from each of the initial stripes remain unchanged, and become the data blocksĀ from the final stripe. The new (parity) blocks from the final stripe are constructed by accessing blocksĀ from the initial stripes as detailed below. To construct the -th new blockĀ (corresponding to the -th column of , ), read parity blockĀ from each initial stripe , and then sum the blocksĀ read. The encoding vectorĀ of the new blockĀ will be equal to the sum of the encoding vectorsĀ of the blocksĀ read (recall from SectionĀ IV-A that the initial encoding vectorsĀ are embedded into a dimensional space). This is done for every new encoding vectorĀ .
VI-C Sequence of Hankel-based constructions and Handling a priori unknown parameters
Sequence of Hankel-based constructions. Our idea of Hankel-array-based construction yields a sequence of access-optimal MDS convertible codes with a tradeoff between field size and the range of supported. The two constructions presented in SectionĀ VI-A and SectionĀ VI-B are the two extreme points of this tradeoff.
In particular, our construction can support, for all :
[TABLE]
The parameter corresponds to the number of groups into which the encoding vectorsĀ corresponding to the initial parities are split. That is, each group of consecutive initial parity encoding vectorsĀ has size or . The Hankel-IĀ construction corresponds to and Hankel-IIĀ corresponds to .
Handling a priori unknown parameters. So far, we had assumed that the parameters of the final code, , are known a priori and are fixed. As discussed in SectionĀ III, this is useful in developing an understanding of the fundamental limits of code conversion. When realizing code conversion in practice, however, the parameters might not be known at code construction time (as it depends on the empirically observed failure rates). Thus, it is of interest to be able to convert a code optimally to multiple different parameters. The Hankel-array based constructions presented above indeed provide such a flexibility. Our constructions continue to enable access-optimal conversion for any and with and .
VII Conclusions and Future directions
In this paper, we propose the ācode conversionā problem, that models the problem of converting data encoded with an code into data encoded with an code in a resource-efficient manner. The proposed problem is motivated by the practical necessity of reducing the overhead of redundancy adaptation in erasure-coded storage systems. This is a new opportunity beckoning coding theorists to enable large-scale real-world storage systems to adapt their redundancy levels to varying failure rates of storage devices, thereby achieving significant savings in resources and energy consumption. We present the framework of convertible codes for studying code conversions, and fully characterize the fundamental limits for the access cost of conversions for an important regime of convertible codes. Furthermore, we present practical low-field-size constructions for access-optimal convertible codes for a wide range of parameters.
This work leads to a number of challenging an potentially impactful open problems. An important future direction is to go beyond the merge regime considered in this paper and study the fundamental limits on the access cost and construct optimal convertible codes for general parameter regimes. Another important future direction is to analyze the fundamental limits on the overhead of other cluster resources during code conversions, such as network bandwidth, disk IO, and CPU consumption, and construct convertible codes optimizing these resources. Note that while the access-optimal convertible codes, considered in this paper, also reduce the total network bandwidth, disk IO, and CPU overhead during conversion as compared to the default approach, the overhead on these other resources may not be optimal.
Acknowledgements
We thank Michael Rudow for his valuable feedback and helpful comments during the writing of this paper.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] D. Ford, F. Labelle, F. Popovici, M. Stokely, V. Truong, L. Barroso, C. Grimes, and S. Quinlan, āAvailability in globally distributed storage systems,ā in USENIX Symposium on Operating Systems Design and Implementation , 2010.
- 2[2] K. V. Rashmi, N. B. Shah, D. Gu, H. Kuang, D. Borthakur, and K. Ramchandran, āA solution to the network challenges of data recovery in erasure-coded distributed storage systems: A study on the Facebook warehouse cluster,ā in Proceedings of USENIX Hot Storage , Jun. 2013.
- 3[3] āā, āA Hitchhikerās guide to fast and efficient data reconstruction in erasure-coded data centers,ā in ACM SIGCOMM , 2014.
- 4[4] M. Sathiamoorthy, M. Asteris, D. Papailiopoulos, A. G. Dimakis, R. Vadali, S. Chen, and D. Borthakur, āXO Ring elephants: Novel erasure codes for big data,ā in VLDB Endowment , 2013.
- 5[5] S. Ghemawat, H. Gobioff, and S. Leung, āThe Google file system,ā in ACM SIGOPS Operating Systems Review , vol. 37, no. 5. ACM, 2003, pp. 29ā43.
- 6[6] D. Borthakur, R. Schmidt, R. Vadali, S. Chen, and P. Kling, āHDFS RAID - Facebook.ā [Online]. Available: http://www.slideshare.net/ydn/hdfs-raid-facebook
- 7[7] C. Huang, H. Simitci, Y. Xu, A. Ogus, B. Calder, P. Gopalan, J. Li, and S. Yekhanin, āErasure coding in Windows Azure storage,ā in Proceedings of USENIX Annual Technical Conference (ATC) , 2012.
- 8[8] Apache Software Foundation, āApache hadoop: HDFS erasure coding,ā accessed: 2019-07-23. [Online]. Available: https://hadoop.apache.org/docs/r 3.0.0/hadoop-project-dist/hadoop-hdfs/HDFS Erasure Coding.html
