An Indel-Resistant Error-Correcting Code for DNA-Based Information   Storage

William H. Press; John A. Hawkins

arXiv:1812.01112·q-bio.QM·December 5, 2018

An Indel-Resistant Error-Correcting Code for DNA-Based Information Storage

William H. Press, John A. Hawkins

PDF

Open Access

TL;DR

This paper introduces HEDGES, an error-correcting code designed for DNA data storage that effectively corrects substitutions, insertions, and deletions within a single read, improving efficiency and approaching Shannon limits.

Contribution

The paper presents HEDGES, a novel ECC capable of correcting all major DNA sequencing errors in one read, advancing DNA storage reliability and efficiency.

Findings

01

Corrects up to ~10% nucleotide errors.

02

Achieves 50% or more of Shannon limit.

03

Operates effectively within varying code rates.

Abstract

Synthetic DNA can in principle be used for the archival storage of arbitrary data. Because errors are introduced during DNA synthesis, storage, and sequencing, an error-correcting code (ECC) is necessary for error-free recovery of the data. Previous work has utilized ECCs that can correct substitution errors, but not insertion or deletion errors (indels), instead relying on sequencing depth and multiple alignment to detect and correct indels -- in effect an inefficient multiple-repetition code. This paper describes an ECC, termed "HEDGES", that corrects simultaneously for substitutions, insertions, and deletions in a single read. Varying code rates allow for correction of up to ~10% nucleotide errors and achieve 50% or better of the estimated Shannon limit.

Tables2

Table 1. Table 1: Mapping of bits b i subscript 𝑏 𝑖 b_{i} to variable-bits v i subscript 𝑣 𝑖 v_{i} for various code rates

Code Rate	Pattern	$P_{ok}$ (see text)
0.750	$2, 1, 2, 1, \dots$	$- 0.035$
0.600	$2, 1, 1, 1, 1, 2, 1, 1, 1, 1, \dots$	$- 0.082$
0.500	$1, 1, \dots$	$- 0.127$
0.333	$1, 1, 0, 1, 1, 0, \dots$	$- 0.229$
0.250	$1, 0, 1, 0, \dots$	$- 0.265$
0.166	$1, 0, 0, 1, 0, 0, \dots$	$- 0.324$

Table 2. Table 2: Probability that the outer code R S ( 255 , 223 ) 𝑅 𝑆 255 223 RS(255,223) fails to correct 100% of message errors as a function of P equiv subscript 𝑃 equiv P_{\text{equiv}} , the inner-code HEDGES output byte error rate (plus half the erasure rate). One sees that P equiv < 0.01 subscript 𝑃 equiv 0.01 P_{\text{equiv}}<0.01 is sufficient for error-free decoding of gigabyte-length messages.

$P_{equiv}$	$P_{failure}$
0.005	$5.25 \times 10^{- 14}$
0.010	$2.08 \times 10^{- 9}$
0.020	$2.53 \times 10^{- 5}$
0.030	$2.39 \times 10^{- 3}$
0.040	$3.16 \times 10^{- 2}$

Equations32

b_{i}, i = 0, 1, 2, \dots, M, b_{i} \in {0, 1}

b_{i}, i = 0, 1, 2, \dots, M, b_{i} \in {0, 1}

C_{i}, i = 0, 1, 2, \dots, N, C_{i} \in {A, C, G, T} \equiv {0, 1, 2, 3}

C_{i}, i = 0, 1, 2, \dots, N, C_{i} \in {A, C, G, T} \equiv {0, 1, 2, 3}

S_{i} = known \in Z_{2}^{\otimes s}

S_{i} = known \in Z_{2}^{\otimes s}

I_{i} \equiv i (mod 2^{q})

I_{i} \equiv i (mod 2^{q})

B_{i} \equiv [b_{i - r} b_{i - r + 1} \dots b_{i - 1}] \in Z_{2}^{\otimes r}

B_{i} \equiv [b_{i - r} b_{i - r + 1} \dots b_{i - 1}] \in Z_{2}^{\otimes r}

F (S, I, B) : Z_{2}^{\otimes (r + q + s)} \to Z_{4}

F (S, I, B) : Z_{2}^{\otimes (r + q + s)} \to Z_{4}

C_{i} = K_{i} + b_{i} = F (S_{i}, I_{i}, B_{i}) + b_{i} (mod 4)

C_{i} = K_{i} + b_{i} = F (S_{i}, I_{i}, B_{i}) + b_{i} (mod 4)

\begin{array}[]{ll}H\text{:=}\,[i+1,\{0,1\},B_{i+1},k]:&\Delta P=P_{\text{del}}\\ H\text{:=}\,[i+1,\{0,1\},B_{i+1},k+1]:&\Delta P=(P_{\text{ok}}\text{ if $C=C^{\prime}$ else }P_{\text{sub}})\\ H\text{:=}\,[i+1,\{0,1\},B_{i+1},k+2]:&\Delta P=(P_{\text{ins}}+P_{\text{ok}}\text{ if $C=C^{\prime}$ else }P_{\text{ins}}+P_{\text{sub}})\\ \end{array}

\begin{array}[]{ll}H\text{:=}\,[i+1,\{0,1\},B_{i+1},k]:&\Delta P=P_{\text{del}}\\ H\text{:=}\,[i+1,\{0,1\},B_{i+1},k+1]:&\Delta P=(P_{\text{ok}}\text{ if $C=C^{\prime}$ else }P_{\text{sub}})\\ H\text{:=}\,[i+1,\{0,1\},B_{i+1},k+2]:&\Delta P=(P_{\text{ins}}+P_{\text{ok}}\text{ if $C=C^{\prime}$ else }P_{\text{ins}}+P_{\text{sub}})\\ \end{array}

S_{0} S_{i} S_{i} = 0 = S_{i - 1} b_{i}, i = 1, \dots, n - 1 (denoting concatenation) = S_{i - 1}, i \geq n

S_{0} S_{i} S_{i} = 0 = S_{i - 1} b_{i}, i = 1, \dots, n - 1 (denoting concatenation) = S_{i - 1}, i \geq n

Rate 0.750: v_{0} Rate 0.250: v_{0} = b_{0} b_{1}, v_{1} = b_{2}, v_{2} = b_{3} b_{4}, v_{3} = b_{5}, \dots = b_{0}, v_{1} = 0, v_{2} = b_{1}, v_{3} = 0, \dots

Rate 0.750: v_{0} Rate 0.250: v_{0} = b_{0} b_{1}, v_{1} = b_{2}, v_{2} = b_{3} b_{4}, v_{3} = b_{5}, \dots = b_{0}, v_{1} = 0, v_{2} = b_{1}, v_{3} = 0, \dots

C_{i} = K_{i} + v_{i} = F (S_{i}, I_{i}, V_{i}) + v_{i} (mod 4)

C_{i} = K_{i} + v_{i} = F (S_{i}, I_{i}, V_{i}) + v_{i} (mod 4)

N_{equiv} = (byte errors) + 0.5 \times (byte erasures) \leq 16

N_{equiv} = (byte errors) + 0.5 \times (byte erasures) \leq 16

P_{failure} = PoissonCDF (k > 16 ∣ λ = 255 P_{equiv})

P_{failure} = PoissonCDF (k > 16 ∣ λ = 255 P_{equiv})

C_{Shannon} = 2 - [H_{2} (p) + p lo g_{2} 3 + \frac{1}{3} p lo g_{2} 4 + \frac{1}{3} p lo g_{2} 3]

C_{Shannon} = 2 - [H_{2} (p) + p lo g_{2} 3 + \frac{1}{3} p lo g_{2} 4 + \frac{1}{3} p lo g_{2} 3]

H_{2} (p) \equiv - p lo g_{2} p - (1 - p) lo g_{2} (1 - p)

H_{2} (p) \equiv - p lo g_{2} p - (1 - p) lo g_{2} (1 - p)

C_{HEDGES} = 2 \times (code rate) \times [1 - H_{2} (p_{equiv})]

C_{HEDGES} = 2 \times (code rate) \times [1 - H_{2} (p_{equiv})]

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDNA and Biological Computing · Advanced biosensing and bioanalysis techniques · Algorithms and Data Compression

Full text

An Indel-Resistant Error-Correcting Code for DNA-Based Information Storage

W.H. Press and J.A. Hawkins

Institute for Computational Engineering and Sciences

University of Texas at Austin

1 Introduction

Engineered DNA is an information channel. One can convert an arbitrary message into a string of DNA characters, or bases, $\{A,C,G,T\}$ , synthesize the string into a physical DNA sample; store or transport the sample through space and time; sequence it back to a string of characters; and then hope to recover exactly the original message. Because errors are introduced during all the stages of synthesis, storage, and sequencing, it is necessary to utilize an error-correcting code (ECC) at the stage of converting message bits to DNA characters (encoding), and then later, when DNA characters are converted back to message bits (decoding). The ECC needs to correct three kinds of errors: substitutions of one base by another, spurious insertions of bases, and deletions of bases from the message. Insertions and deletions are commonly termed “indels”.

The correction of substitutions is a standard problem in coding theory, where substitutions are termed “errors”. The overarching theoretical framework for coding theory starts with Shannon [1], and there exist hundreds, if not thousands, of well studied error-correcting codes (ECCs) [2, 3, 4, 5]. However, established methods for error correction in the case of silent deletions—termed deletion channels—are few; and there are virtually no established methods for channels with all three of deletions, insertions, and substitutions. (See [6] and [7] for reviews and references.) Indeed, no approaches suggested in the literature are well suited to DNA applications [8]. For example, almost all attention has been on binary channels, while the DNA channel is quaternary.

As the main contribution of this paper, we describe in Section 4 a method for encoding a stream of arbitrary message bits onto a stream of DNA characters with an ECC that simultaneously corrects all three kinds of errors. Our ECC, which we call HEDGES (for “Hash Encoded, Decoded by Greedy Exhaustive Search”), is tuned to recover character-by-character message synchronization, even at the cost of leaving a small number of uncorrected substitution errors . This tuning makes HEDGES useful as the “inner” code (closest to the channel and applied last in encoding) in a concatenated code design, leaving it to a conventional “outer” code (applied first in encoding, last in decoding) to correct any remaining substitution errors. Below, in combination with HEDGES, we will use the standard Reed-Solomon (R-S) code [9] denoted RS(255,223). R-S codes are completely intolerant of indels, i.e., they require perfectly synchronized input.

Because of our tuning of HEDGES for use in a concatenated design, it will be useful to first describe a possible full system design (Section 3), and only then describe HEDGES in detail (Section 4). The system design in Section 3 is illustrative but not unique. HEDGES itself, as a quaternary-alphabet indel-resistant ECC, is general and can readily be utilized in other overall designs.

2 Related Work

There is a growing body of experimental work on DNA information storage, employing various strategies for dealing with errors. We summarize chronologically some of the previous work as it relates to this paper.

Church et. al [10] synthesized oligomers of length 159, each of which contained both address information (ordering of oligomers in the message) and payload. There was no explicit ECC. The pool of oligomers was sequenced to a depth 3000x, allowing recovery of a consensus sequence with high probability. High-depth coverage can correct sequencing errors, but not synthesis errors. Indeed, the final results contained 10-bit errors.

Goldman et. al [11] synthesized oligos with 3/4 of each strand overlapping the previous strand, in effect a 4x repetition code. Each strand had parity check bits for error detection (but not correction). A ternary code (ternary message alphabet to quaternary DNA alphabet) was utilized to avoid homopolymers. Error correction was done by sequencing to high depth and filtering for, and aligning, perfectly sequenced fragments. The final results contained two gaps where none of the four overlapping sequences were recovered.

Grass et. al [12] implemented interleaved Reed-Solomon codes for DNA storage. Message bits were converted to characters in the 47-character alphabet GF(47), with interleaved correction in blocks of $(713\times 39)$ characters. An inner code mapped GF(47) characters to 47 DNA trimers chosen to guarantee no homopolymers of length $>3$ in the final DNA message, but with no other redundancy. Indels were corrected by sequencing at sufficient depth to reject faulty strands.

Bornholt et. al [13] introduced strand-to-strand redundancy by creating strand $C=A\oplus B$ (or other redundant combinations), where $\oplus$ denotes exclusive-or, and utilizing majority dedoding. Parity check bits allowed the filtering out of faulty strands. There was no explicit correction of indels.

Erlich and Zielinski [14] utilized quite a different overall architecture, based on fountain codes. Fountain codes send linear combinations of portions of a message in “droplets”, such that one can recover the original message by the solution of linear equations. The included redundancy allows the loss of some droplets. R-S coding was used within each oligomer droplet for error detection but not correction. Faulty droplets, including any with indels, were rejected.

Yazdi et. al [15] leveraged the multiple sequence alignment capabilities of several sophisticated packages in conjunction with a custom homopolymer check code to correct the large error rates—especially homopolymer errors—associated with MinION nanopore sequencing. Sequencing depths were in the range of several hundred.

In the largest-scale experiment to date, Organick et. al [16] encoded and recovered, error-free, more than 200 MB of data. R-S coding was used across strands, with no explicit error correction within a strand. Substitutions and indels within a strand were corrected by multiple alignment and consensus calling. Coverage was 5x for high-quality Illumina sequencing, rising to 36x to 80x required for Nanopore technology.

To summarize, while previous work has adopted increasingly sophisticated system designs, there has been little progress in the fundamental problem of correcting indels within a single strands. The use of sequencing to large depth, followed by multiple alignment, is in effect a use of the oldest, simplest, and arguably least efficient ECC, namely a simple repetition code, sending the same message multiple times and taking the consensus. This manifests itself as sequencing stored DNA to high coverage, finding sets of reads which appear to derive from the same intended sequence, and either merging the reads into a consensus sequence and/or filtering out any reads which fail some quality check. Though this is a central part of previously implemented DNA error-correction schemes, it also tends to be left unaccounted for in claims for code efficiency. The central result in this paper is a technique for correcting substitutions and indels in a single strand, i.e., when sequenced to no more than depth one.

3 Example System Design

For cost and efficiency, both DNA synthesis and DNA sequencing employ massive parallelism. That is, many short sequences, each of length hundreds to thousands of bases, are written (synthesized) or read (sequenced) simultaneously. While the length of a single synthesis or read will increase as technology improves, it is unlikely that the great advantage of parallelism will ever be superceded. This being the case, the basic units of our system design are individual strands of length $10^{2}$ – $10^{4}$ .

To connect with our use of RS(255,223), we define a “DNA packet” as an ordered set of 255 DNA strands. When any one strand in the set is decoded with HEDGES, it produces a message fragment of length $L$ bytes (say), now having high probability of being perfectly synchronized. Each 255 correctly ordered message fragments form a “message packet”, as illustrated in Figure 1. There can be any number of message packets in a total communication.

The Reed-Solomon code is applied across the strands (interleaved). This enables it to protect against missing strands—“erasures” to coding theorists—as well as correcting any residual substitution errors that were not corrected by HEDGES. Different from previous investigations, we apply the R-S code diagonally across the strands (see Figure). This increases the resistance to any failure of synthesis or sequencing to produce full-length strands. It also ameliorates the effect of the observed tendency for error rates to be higher at the ends of strands.

It is an important point that the Reed-Solomon code can only be applied after the strands in a packet are identified as being from one particular packet (out of an assumed pool of many packets, perhaps millions) and are correctly ordered. This implies that a packet’s identification number and a strand’s serial number within the packet (both shown as shaded green in the Figure) cannot themselves be R-S protected. We will instead protect them by a different technique (“salt protection”) that is described in Section 4.4. Salt protection has the effect of turning uncorrectable errors in the identification/serial bytes into erasures in the message bytes—which are correctable by R-S.

Summarizing, here are the main points that affect the design of HEDGES as an inner code: (1) We don’t need to decode strands of arbitrary length, but only of some known uncorrupted length $L$ . (2) Recovering synchronization has the highest priority. (3) Known erasures are less harmful than unknown substitutions, because R-S can correct twice as many erasures as substitution errors. (4) Burst errors within a single byte are less harmful than distributed bit errors, because R-S corrects a byte at a time. (5) Within the R-S code’s capacity for byte errors and erasures, residual errors will be fully corrected by the outer code, yielding an error-free message.

4 HEDGES, an Indel-Correcting Code

4.1 Overall Strategy

Given a message stream of bits

[TABLE]

(“the message” or “bits”), we want to emit a stream of DNA characters

[TABLE]

(“the codestream” or “characters”). We first describe the case of a half-rate code, where we emit exactly one $C_{i}$ (2 bits of output) for each $b_{i}$ (1 bit of input). In section 4.5 we generalize to codes at other rates $r$ (message bits per codestream bit), $0<r<1$ , so that the streams $b_{i}$ and $C_{i}$ are not then in lockstep, and $M\neq N$ . One should think of $N$ as being on the order of $10^{2}$ to $10^{4}$ , the maximum length of a single DNA strand that can be cheaply synthesized today or in the foreseeable future.

We want to be able to decode without residual errors a received codestream $C^{\prime}$ that differs from $C$ by substitutions (errors), insertions, and deletions (collectively “indels”). Indels are silent: their positions in the codestream $C^{\prime}$ are not known to the receiver.

The basic plan is a variant of a centuries-old cryptographic technique, “text auto-key encoding” [17]. We generate a keystream of characters $K_{i}\in\{0,1,2,3\}$ , where each $K_{i}$ depends pseudorandomly (but deterministically by a hash function) on some number of previous message bits $b_{j}$ (with $j<i$ ), and also directly on the bit position index $i$ . (We can initialize the previous bits by defining $b_{j}\equiv 0$ when $j<0$ .) We then emit a codestream character $C_{i}=K_{i}+b_{i}$ , the addition performed modulo 4. In the terminology of modern code theory, this scheme would be called a type of “tree code” or, more specifically, an “infinite constraint-length convolutional code”.

The redundancy necessary for error correction comes from the fact that $b_{i}$ takes on only two values, while $K_{i}$ and $C_{i}$ can have four values. This generates (only) one bit of redundancy per character, i.e., can be acausally valid by chance half the time. However, the dependence of $K_{i}$ on many previous message bits ties any given message bit to many future bits of redundancy. Similarly, the dependence of $K_{i}$ on $i$ ties every bit to its position index, so that (as we will see) insertions can be identified and removed, and deleted values can be restored.

What is not obvious is that a codestream thus generated can actually be practically decoded, especially in the presence of errors and indels at significant rates. We will show by numerical simulation that it can be, remarkably easily, essentially by guessing successive message bits and scoring against the likelihood of the codestream under the guessed hypothesis. Wrong guesses will be rejected by implying exponentially small downstream likelihoods. In coding theory, this general technique is known as “sequential decoding”.

4.2 Encoding Algorithm

Elaborating slightly on the above description, let $S_{i}$ denote an arbitrary $s$ -bit value (“salt”) that can depend on $i$ but is known to both sender and receiver,

[TABLE]

Denote the low-order $q$ bits of the bit position index $i$ by

[TABLE]

Let $B_{i}$ denote the $r$ previous concatenated bits

[TABLE]

Finally, let $F(S,I,B)$ be a deterministic hash function from $r+q+s$ bits to 2 bits

[TABLE]

Then the formula for encoding is

[TABLE]

Figure 2 shows the algorithm graphically.

Typical values that we use are $r=8$ , $q=10$ , $s=46$ , so that $r+q+s=64$ bits, a convenient value for input to the hash. For the hash function we use the low order 2 bits from the Numerical Recipes [18] function Ranhash.int64(), because it is very fast and will occur in the inner loop of the decode algorithm.

4.3 Decoding Algorithm

For simplicity, assume that error rates are “small”, so that “most” DNA bases are received as they were intended. (We will see in Section 5 that DNA character error rates up to $\sim 5$ %– $10$ % are tolerable.) Suppose we have correctly decoded and synchronized the message through bit $b_{i-1}$ and now want to know bit $b_{i}$ . Guessing the two possibilities, $\{0,1\}$ , we use equation (2) to predict two possibilities for the character $C_{i}$ . In the absence of an error, only one of these is guaranteed to agree with the observed character $C_{i}^{\prime}$ . We assign to a guess that generates disagreement with $C_{i}^{\prime}$ a penalty score equal (conceptually) to the negative log probability of observing a substitution error. In other words, a wrong guess might actually be right, but only if a substitution has occurred. If neither guess produces the correct $C_{i}$ , then both are assigned the substitution penalty.

We have not yet accounted for the possibility of insertions and deletions, however. In fact, there are more than the above two possible guesses. We must guess not just $b_{i}\in\{0,1\}$ , but also a “skew” $\Delta\in\{\ldots,-1,0,1,\ldots\}$ that tells us whether in comparing $C$ to $C^{\prime}$ we should skip characters ( $\Delta>0$ ) because of insertions, or posit missing characters ( $\Delta<0$ ) because of deletions (in which case there is no comparison to be done). As a practical simplification we consider only $\Delta\in\{-1,0,1\}$ . (We comment on this simplification in Section 4.6.) Then there are six guesses for $(b_{i},\Delta)\in\{0,1\}\otimes\{-1,0,1\}$ . Each can be scored by an appropriate log probability penalty for any implied substitution, insertion, or deletion.

Log probability penalties accumulate additively along any chain of guesses. In the causal case of a chain of all-correct guesses, we accumulate penalties only in the (relatively rare) case of actual errors. However, because of the way that the key $K_{i}$ (equation (2)) is constructed, single wrong guess for either $b_{i}$ , $i$ , or $\Delta$ throws us into the acausal case where 3/4 of subsequent comparisons of computed $C$ (at some bit position index $i$ ) to observed $C^{\prime}$ (at some index $k$ ) will not agree—thus penalties will accumulate rapidly. The decoding problem, conceptually a maximum likelihood search, thus reduces to a shortest-path search in a tree with branching factor 6, but with the saving grace that the correct path will be much shorter than any deviation from it.

We can formalize the above discussion as follows. Let $H\text{:=}\,[i,b_{i},B_{i},k]$ denote the joint hypothesis that the values $i,b_{i},B_{i}$ are all correct and synchronize to the observed codestream character $C_{k}^{\prime}$ through equation (2). As a node in the search tree, the hypothesis $H\text{:=}\,[i,b_{i},B_{i},k]$ spawns six child hypotheses, each of which can be scored with additional penalty $\Delta P$ (to be added to their common parent’s accumulated penalty) as follows:

[TABLE]

Here $P_{\text{sub}},P_{\text{ins}},P_{\text{del}}$ can be thought of as respectively the log probability penalties for substitution, insertion, or deletion errors (but see Section 4.6). $P_{\text{ok}}$ is the penalty or, if negative, reward, for an agreement between the computed and received codestream characters $C$ and $C^{\prime}$ . In the comparison notated above as $C=C^{\prime}$ , the index of $C$ is the first parameter in the hypothesis $H$ , while the index of $C^{\prime}$ is the last parameter in $H$ . Note that a child node’s $B_{i+1}$ is always computable from its parent’s $B_{i}$ and $b_{i}$ .

How can we practically search this huge tree? A conceptual starting point is the famous A* search algorithm [19], a best-first (that is, “greedy”) search utilizing a heap data structure. A* assigns a heuristic cost to every node that is the sum of its actual cost plus a quantity less than or equal to the smallest possible additional cost that it can incur in reaching the goal. (For a tree of constant depth, this is equivalent to adding a reward for every step taken closer to the leaf nodes, i.e., a negative constant $P_{\text{ok}}$ above.) Figure 3 shows the logical flow of an A* search, and also the HEDGES decode algorithm. As already remarked, in coding theory, this kind of decoding strategy is called “sequential decoding”.

Provably, A* always finds the best path. For our application, unfortunately, it is exponentially slow, because actual errors along the true path cause too many spawned hypotheses to be revisited; and because its termination criterion is too restrictive, again leading to too many spawned hypotheses.

To ameliorate these problems we make two heuristic modifications of A*: First, we allow $P_{\text{ok}}$ to be more negative than that sanctioned by A* and tune its value heuristically. While we thus lose the guarantee of finding exactly the shortest path, we heuristically encourage the search not to revisit earlier hypotheses after a sufficiently lengthy run of successes along one particular chain. Second, we adopt a “first past the post” termination criterion. That is, the first chain of hypotheses to decode the required $L$ bytes of message wins. It is not obvious (or, by us, provable) that these heuristics should result in a workable or efficient algorithm, but we will demonstrate by numerical experiment that it does.

4.4 Use of Salt to Protect Critical Message

Above, we noted the importance of protecting message bits that determine the ordering or “serial number” of strands for the outer, concatenated Reed-Solomon code. In equation (2) (and Figure 2) we allowed for some number of bits of known salt $S_{i}$ when message bit $b_{i}$ is encoded. Here is how this salt is enabling of extra protection: Suppose we want to protect an initial $n$ message bits. Then define recursively the salt by

[TABLE]

Most errors in the first $n$ bits will be corrected as usual by the shortest-path heap search. But any residual error that gets through will “poison” the salt for the entire rest of the strand, rendering it undecodable. In effect we convert an error in the protected bits into an erasure of the whole strand. This may seem drastic, but it is just what we want: An strand with incorrect serial number (and hence incorrect ordering among other strands) would look like a strand of errors (with probablility 255/256 per byte) to the outer R-S; an erased strand is equivalent to only half as many errors.

4.5 Code Rates Other than One-Half

A simple modification of the encode and decode algorithms described in Sections 4.2 and 4.3 allows for code rates other than one-half. Take the input bitstream of expression (1) and partition it into a stream of values $v_{k}$ with variable numbers of bits in the range [math] to $2$ , according to a repetitive pattern like the ones shown in Table 1.

Here are two examples showing how to interpret the entries in Table 1 (with adjacency denoting two-bit values in $\mathbb{Z}_{4}$ ):

[TABLE]

Equation (2) for encoding now becomes

[TABLE]

where $V_{i}$ is composed of concatenated previous variable bits. Pattern values of 0 provide one bit of additional redundancy check relative to the base case of code rate one-half, while pattern values of 2, encoding 2 bits per DNA character, provide one less bit. By construction the code rate is one-half the average of the integers in the pattern. The column in the table labeled $P_{\text{ok}}$ will be explained in Section 4.6.

Decoding follows exactly the same pattern. Guessing a two-bit $v_{i}$ spawns 12 child hypotheses, while guessing a zero-bit $v_{i}$ spawns only 3.

4.6 Choice of, And Trade-Offs Among, Parameters

For encoding, the parameter choices are (i) the choice of code rate and variable bit pattern (as in Table 1), the default case being code rate $0.5$ ; (ii) the number $q>0$ of low-order bits of position index in the hash; (iii) the number $r>0$ of previous message bits in the hash; (iv) the number $s\geq 0$ of salt bits; and (v) the number $n\geq 0$ of initial message bits to be protected by salt.

It might at first seem that bigger is better for both $q$ and $r$ , but this is not the case. Restricting $r$ to a smaller value better allows the heap search to recover from previous errors, basically by finding an acasual (i.e., “wrong”) path that coincidentally puts it back on track. As for $q$ , restricting it to a smaller value could be useful in case one desires the capability of jumping into the middle of an undecoded message: The heap can then be initialized with all possible values of $I$ and $B$ (cf. Figure 2). For our system design, Section 3, this is not a necessary, or useful, capability, however. For the baseline validation experiments in Section 5, we take $q=10$ , $r=8$ , $n=16$ or $24$ .

For decoding, we need to know the encoding parameters, and must now also choose values for $P_{\text{sub}},P_{\text{del}},P_{\text{ins}}$ , and $P_{\text{ok}}$ . While, conceptually, these are negative log probabilities of the occurrence of the different kinds of errors (which can be known only after the fact), we adopt a more empirical approach. First, we take $P_{\text{sub}}=P_{\text{del}}=P_{\text{ins}}$ to give the HEDGES decoding algorithm equal robustness against all three kinds of errors. Second, we note that the search for shortest path is invariant under applying the same linear (or affine) transformation to all four $P$ ’s. So, without loss of generality, we may take $P_{\text{sub}}=P_{\text{del}}=P_{\text{ins}}=1$ , leaving $P_{\text{ok}}$ as the only free parameter. We determine optimal (or at least good) values for $P_{\text{ok}}$ by numerical experiment. We find that the optimal $P_{\text{ok}}$ depends only negligibly on the encoding parameters $q$ and $r$ , and only slightly on the length $L$ of the strand, but it does depend on the code rate. Good values for various code rates are given in the third column of Table 1.

Implicitly, the choice of $P_{\text{ok}}$ reflects a tradeoff between computational workload and decode failure probability. $P_{\text{ok}}$ that is too negative results in too greedy a search, which is fast but can get stuck in a blind alley that requires us to declare the rest of the strand as an erasure (hence its dependence on strand length). On the other hand, $P_{\text{ok}}$ that is insufficiently negative results in a too large, potentially exponential, expansion of the size of the heap. Happily, there is an accessible range of workable values. Changes of $\sim 10$ % in $P_{\text{ok}}$ matter little, and our values are implicitly tuned for best performance on strand lengths in the range $\sim 100$ to $\sim 1000$ .

In Section 4.3 above we limited the guesses for $\Delta$ to only $\{-1,0,1\}$ so as to limit the expansion of the heap. This results in more than one consecutive insertion or deletion being improperly scored. For example, without the possibility of skew $\Delta=-2$ , the shortest available path through two deletions $\ldots DD\ldots$ declares a spurious substitution $\ldots DSD\ldots$ . In practice, this makes little difference, because double deletions are significantly less common than single deletions, and because other, completely incorrect, paths score much worse.

It is an important point that choosing any set of decode parameters is not an irrevocable choice. Given a DNA message, one can make multiple tries, varying the decode parameters adaptively until acceptable performance is achieved. One can evaluate success by running time and by the count of errors needing correction by the outer R-S code. The parameter values that we suggest may be viewed as starting points.

5 Computer Validation Experiments

We have implemented HEDGES in C++ code, with also a Python interface for convenience. (We similarly implemented a compatible Python interface to the published “Schifra” implementation of Reed-Solomon.[20]) For tests on individual strands of length $L$ , we encode a random stream of message bits and degrade the resulting codestream by errors with a specified Poisson-random total rate, divided equally among the three error types, substitution, insertion, and deletion. Unless otherwise stated the HEDGES code rate is one-half.

We allow each decode a “hypothesis budget”, that is, a maximum size to which the heap is allowed to expand. If, along a strand, a decode exceeds its budget, we declare subsequent message bits in that strand to be erasures. Figure 4 shows examples of how decodes expend their budget along a long strand. There is a sharp bifurcation between decodable strands, which typically expend $\lesssim 100$ hypotheses per decoded bit, and undecodable strands, which go into blind alleys and readily expend $\gtrsim 1000$ hypotheses per decoded bit. The figure shows 10 selected examples of each type. In practice undecodable strands are much rarer than decodable ones.

Figure 5 shows decode failure rates actually achieved by a half-rate code, as a function of length of strand, input total error rate, and hypothesis budget. One sees that, for strand lengths in the useful range 100–1000, failure rates $\lesssim 10^{-2}$ are readily achievable for total input error rates up to $\sim 5$ %. For an input error rate of 3%, strand lengths up to $10^{4}$ are feasible. Failure rates $\lesssim 10^{-2}$ are easily absorbed as erasures in the error budget of an outer, interleaved Reed-Solomon code (see below, this section).

In the case of a successful decode, there may remain uncorrected substitution errors. Figure 6 shows the uncorrected (output) bit, and byte, error rates along strands of length 240 for the three input codestream character error rates 3%, 5%, and 10%. The uncorrected error rates vary along the strand for two reasons: First, for this experiment, we applied salt protection to the first 24 message bits (that is, 24 characters for the half-rate code). One sees that this worked as advertised: There were no uncorrected errors in the first 24 bits. Second, uncorrected error rates are seen to rise as the length of the strand is approached. Although undesirable, this is an inevitable feature of HEDGES. As the strand end is approached, there are fewer redundancy checks available downstream, making the greedy search algorithm less selective. Our system design (Section 3) allows for specifying some number of “runout” bits at the ends of the strands, encoding zeros and not part of the message packet. How much runout to allow depends on how much one wants to burden the R-S error budget. For this numerical experiment, we assumed 24 runout bits.

It is notable that the byte error rates in Figure 6 are only $\approx 3$ times the bit error rates, rather than $\lesssim 8$ times (depending on the bit error rate) if the errors were randomly distributed. This shows that HEDGES’s uncorrected errors are bursty, which is good for input to R-S and gains some overall efficiency.

Exclusive of the salt-protected and runout regions, the uncorrected bit error rates for this experiment are about $1\times 10^{-3}$ , $3.5\times 10^{-3}$ , $2\times 10^{-2}$ for input error rates 3%, 5%, and 10%, respectively. The corresponding byte error rates are $3\times 10^{-3}$ , $1\times 10^{-2}$ , and $6\times 10^{-2}$ .

What level of uncorrected errors may we allow to get through to the R-S outer code, with a high probability that they will there be corrected? $RS(255,223)$ is able to correct 16 byte-errors, or any combination of byte errors and erasures whose equivalent number is

[TABLE]

The probability of failure to completely correct 255 bytes, for Poisson random byte errors/erasures is thus the cumulative probability

[TABLE]

where $P_{\text{equiv}}$ is the byte error rate (B.e.r.) plus half the erasure rate. As mentioned above, we interleave and apply the R-S code diagonally (see Figure 1) because (i) strands may be missing, (ii) byte errors along a single strand may be bursty, and (iii) it is found experimentally that sequencing and synthesis error rates can be different (often larger) near to the end of strands. The interleaved diagonal pattern ensures that no single $255$ length R-S packet gets handicapped by an error rate much different than the average across the whole strand, and that its number of errors will be distributed with (close to) Poisson statistics. Table 2 evaluates equation (6) for relevant values of $P_{\text{equiv}}$ . One sees that a value $P_{\text{equiv}}\lesssim 1\%$ are adequate to guarantee error-free decoding of messages of gigabyte length or longer.

Figure 7 now shows the results of a numerical experiment evaluating $P_{\text{equiv}}$ as a function of input DNA error rates for six different code rates. The evaluation was done with strand length $L=300$ (as a variant of the value $L=240$ in Figure 6), no salt protection, and 2 bytes of runout on each strand (errors in which are not counted). Using Table 2, one sees that a DNA error rate of about 1% is correctable at code rate 3/4 with probability effectively 1, increasing to a correctable error rate of about 15% at code rate 1/6.

6 Channel Capacity re Shannon Limit

We might wonder how close the results of Figure 7 come to the absolute bound of the Shannon limiting channel capacity [1]. Unfortunately, computing the Shannon limit for even the simplest case of a binary deletion channel, let alone channels with also insertions and substitutions, remains a difficult unsolved problem [6, 22]. Still, it is possible to get some idea by making an informed estimate as follows.

A remarkable theorem of Shannon [21] proves that the channel capacity of a forward error-corrected channel is identical to that of a “feedback channel,” where the sender gets to see (error-free) what was actually received, and then send correcting information post hoc. We can thus estimate channel capacity by reducing the maximum capacity (for DNA, 2 bits per character) by the entropy of the necessary correction messages. The reason that this is an estimate only (strictly, a lower bound), is that we may not be sending the optimally short correction messages, especially as the error rate becomes large.

In our case, suppose $p$ is the character error rate. As before, assume equal probabilities $p/3$ for the three kinds of errors. Then we can estimate the Shannon limit of the channel as

[TABLE]

where $H_{2}$ is the entropy of a binary choice,

[TABLE]

In equation (7), the first term in brackets is the cost of communicating whether a particular code character marks the position of an error. The second term is the cost of telling which kind of error (substitution, deletion, insertion). The third term is the cost of communicating the missing character in a deletion. The fourth is the similar cost for a substitution. While strictly only a lower bound, analogous results for binary deletion channels [22] suggest that equation (7) is actually a good approximation for small values of $p$ .

We now calculate the channel capacity actually achievable with HEDGES by the relation

[TABLE]

Here the factor 2 is the number of bits per DNA character, while the factor in square brackets reflects the loss of channel capacity to an (assumed perfect) concatenated outer code that corrects all of HEDGES’ uncorrected bit errors.

Figure 8 shows the results of the comparison. One sees that HEDGES achieves a respectable fraction, $\gtrsim 0.5$ , of the estimated Shannon limit for DNA character error rates up to 20%.

7 Discussion

Previous work on DNA information storage, despite the increasing sophistication of methods, has largely ignored the possibility of directly correcting insertion and deletion errors by an appropriate error-correcting code. Instead, most previous work has relied on multiple sequence alignment after sequencing DNA messages to significant depths. In effect, though not always acknowledged, this method is an inefficient multiple-repetition code.

This paper developed a coding technique, termed HEDGES, for the direct correction of insertions and deletions, along with substitutions, workable with (combined) DNA character error rates up to 20%, and at a respectable fraction of the Shannon information limit. The code, HEDGES, was optimized for use as the inner code in an overall design with an outer concatenated code that will generally be interleaved across DNA strands. Used with HEDGES, the outer code need not be indel-aware and can be a conventional ECC like Reed-Solomon.

Acknowledgments

We have benefitted from communication with Dave Forney, Dan Costello, and Vince Poor, and from our continuing collaboration in related matters with Ilya Finkelstein and Stephen Jones.

Bibliography22

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Shannon CE, Weaver W, The Mathematical Theory of Communication (University of Illinois Press, 1949)
2[2] Mac Williams FJ, Sloane NJA, The Theory of Error-Correcting Codes (North Holland, 1983)
3[3] Roth RM, Introduction to Coding Theory (Cambridge University Press, 2006)
4[4] Moon TK, Error Correction Coding: Mathematical Methods and Algorithms (Wiley, 2005)
5[5] Lin S, Costello DJ Error Control Coding , Second Edition (Pearson, 2004)
6[6] Mitzenmacher, M, “A survey of results for deletion channels and related synchronization channels”, Probability Surveys , vol. 6, pp. 1-33 (2009)
7[7] Li R, New developments in coding against insertions and deletions , Honors Thesis, Carnegie Mellon University, at https://www-cs.stanford.edu/ rayyli/static/paper/Ray Li_Honors Thesis_20170828.pdf
8[8] Hawkins JA, Jones SK, Finkelstein IJ, Press WH, “Indel-correcting DNA barcodes for high-throughput sequencing”, PNAS, 115 (27), pp. E 6217-E 6226 (2018)