On the Design of Codes for DNA Computing: Secondary Structure Avoidance Codes
Tuan Thanh Nguyen, Kui Cai, Han Mao Kiah, Duc Tu Dao, and Kees A., Schouhamer Immink

TL;DR
This paper presents explicit constructions of DNA codes that completely avoid secondary structures of any stem length, improving code rates and providing efficient encoding methods for DNA computing applications.
Contribution
The work introduces novel explicit constructions for DNA codes that eliminate secondary structures of arbitrary stem length, surpassing previous code rate limits.
Findings
Constructed DNA codes with rate 1.3031 bits/nt for m=3.
Achieved efficient encoding with only one redundant symbol for large m.
Provided methods to avoid secondary structures of any stem length ≥ m.
Abstract
In this work, we investigate a challenging problem, which has been considered to be an important criterion in designing codewords for DNA computing purposes, namely secondary structure avoidance in single-stranded DNA molecules. In short, secondary structure refers to the tendency of a single-stranded DNA sequence to fold back upon itself, thus becoming inactive in the computation process. While some design criteria that reduces the possibility of secondary structure formation has been proposed by Milenkovic and Kashyap (2006), the main contribution of this work is to provide an explicit construction of DNA codes that completely avoid secondary structure of arbitrary stem length. Formally, given codeword length n and arbitrary integer m>=2, we provide efficient methods to construct DNA codes of length n that avoid secondary structure of any stem length more than or equal to m.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced biosensing and bioanalysis techniques · DNA and Biological Computing · DNA and Nucleic Acid Chemistry
On the Design of Codes for DNA Computing: Secondary Structure Avoidance Codes
Tuan Thanh Nguyen1, Kui Cai1, Han Mao Kiah2, Duc Tu Dao2, and Kees A. Schouhamer Immink3
1 Science, Mathematics and Technology Cluster, Singapore University of Technology and Design, Singapore 487372
2School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371
3Turing Machines Inc, Willemskade 15d, 3016 DK Rotterdam, The Netherlands
Emails: {tuanthanh_nguyen, cai_kui}@sutd.edu.sg, {hmkiah,daoductu001}@ntu.edu.sg, [email protected]
Abstract
In this work, we investigate a challenging problem, which has been considered to be an important criterion in designing codewords for DNA computing purposes, namely secondary structure avoidance in single-stranded DNA molecules. In short, secondary structure refers to the tendency of a single-stranded DNA sequence to fold back upon itself, thus becoming inactive in the computation process. While some design criteria that reduces the possibility of secondary structure formation has been proposed by Milenkovic and Kashyap (2006), the main contribution of this work is to provide an explicit construction of DNA codes that completely avoid secondary structure of arbitrary stem length.
Formally, given codeword length and arbitrary integer , we provide efficient methods to construct DNA codes of length that avoid secondary structure of any stem length more than or equal to . Particularly, when , our constructions yield a family of DNA codes of rate 1.3031 bits/nt, while the highest rate found in the prior art was 1.1609 bits/nt. In addition, for , we provide an efficient encoder that incurs only one redundant symbol.
I Introduction
DNA computing is an emerging branch of computing that uses DNA, biochemistry, and molecular biology hardware. The field of DNA computation started with the following demonstration by Adleman in 1994 [1]. In this seminal experiment, Adleman solved an instance of the directed traveling salesperson problem by first representing each city with a synthetic DNA molecule. Then by allowing the strands to hybridize in a highly parallel fashion, Adleman obtained the desired solution. Since then, similar methods have been expanded to several attractive applications, including the development of storage technologies [2, 3, 4, 5], and cell-based computation systems for cancer diagnostics and treatment [6]. Recently, the hybridization process was exploited to allow random access in DNA data storage [7].
In DNA computing, only short single-stranded DNA sequences (or oligonucleotide sequences) are used, where each of them is an oriented word consisting of four bases (or nucleotides): Adenine (A), Thymine (T), Cytosine (C), and Guanine (G). A set of encoded DNA sequences (also called DNA codewords), that satisfies certain special properties (or constraints) for DNA computing purposes, is called a DNA code. A broad description of the kinds of constraint problems that arise in coding for DNA computing was introduced by Milenkovic and Kashyap in 2006 [8], including constant GC-content constraint (refers to the percentage of nucleotides that are either G or C), Hamming distance constraint (that requires DNA codewords to be sufficiently different among themselves), and secondary structure formation avoidance constraint (that prevents DNA sequence to fold back upon itself, and consequently becoming inactive in the computation process). Similar considerations were described in [9, 10] for the design of primer address sequences in random access of DNA-based data storage systems. While constant GC-content constraint and Hamming distance constraint have been extensively investigated [11, 12, 13, 8, 14, 15, 16, 17], the study for secondary structure avoidance is much less profound.
For a DNA sequence, a secondary structure is formed by a chemically active to fold back onto itself by complementary base pair hybridization (illustrated via Figure 1). Here, the Watson-Crick complement is defined as: , and . For a sequence over the DNA alphabet , the reverse-complement of is defined as . In Figure 1, sub-sequences {\color[rgb]{0,0.88,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.88,0}\pgfsys@color@cmyk@stroke{0.91}{0}{0.88}{0.12}\pgfsys@color@cmyk@fill{0.91}{0}{0.88}{0.12}{{\mathbfsl{x}}={\bf A}{\bf T}{\bf A}{\bf C}{\bf C}}} and {\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}{{\mathbfsl y}={\rm RC}({{\mathbfsl{x}}})={\bf G}{\bf G}{\bf T}{\bf A}{\bf T}}} of the DNA sequence bind to each other after pairing of A with T and G with C, forming a secondary structure with a loop and a stem of length 5. DNA sequences with secondary structures are less active in the computation process [8], and hence, before reading such sequences in a wet lab, they need to be unfolded, costing more resources and energy. There exist some simple dynamic programming techniques [18, 19] that can approximately predict the secondary structures in a given DNA sequence (for example, see the Nussinov-Jacobson (NJ) algorithm in [19] as one of the most widely used schemes). Based on the NJ algorithm, the authors in [8, 13] found some design criteria that reduce the possibility of secondary structure formation in a codeword. A natural question is whether there exists efficient design of DNA constrained codes that avoid the formation of secondary structures.
It has been shown experimentally that the number of base pairs in stem regions (or stem length) is one important factor influencing the secondary structure of a DNA sequence. Given codeword length and an integer , we study the problem of constructing DNA codes of length that avoid secondary structure of any stem length more than or equal to . To the best of our knowledge, this work is the first attempt aimed at providing a rigorous solution for DNA codes avoiding secondary structure for general stem lengths.
II Preliminary
In this work, we use to denote the DNA alphabet, where . Here, we have the Watson-Crick complement where , and .
Given two sequences and , we let denote the concatenation of the two sequences.
Throughout this work, given a sequence of length , we say is a subsequence of length of , where , if for some . In other words, we only consider the subsequences including consecutive symbols in . Two subsequences and of are said to be non-overlapping if we have , , where or .
Definition 1**.**
For a DNA sequence , , the reverse-complement of , is defined as .
Definition 2**.**
Given , a DNA sequence is said to be -secondary structure avoidance (or -SSA) sequence if for all , there does not exist any pair of non-overlapping subsequences of length of such that . A code is said to be an SSA code if for every codeword , we have is -SSA.
The following result is immediate.
Lemma 1**.**
Given , if a sequence is -SSA then is -SSA for all .
For a code , the code rate is measured by the value . Intuitively, it measures the number of information bits stored in each DNA symbol. Suppose that we have an infinite family of codes , where is a code of length , then the asymptotic rate of the family is . Here, we adopt the notation to mean logarithm base two.
Definition 3**.**
Given , for , let be the total number of DNA sequences of length that are -SSA. The channel capacity, denoted by , is defined by:
[TABLE]
The following result is immediate.
Lemma 2**.**
Given , let be the set of all DNA sequences of length such that, there is no pair of sequences , not necessary distinct, such that . We then have .
Observe that the size of can be computed easily for constant , a trivial upper bound is that , and consequently, we obtain and .
To construct an SSA code for arbitrary by concatenation method, one can find the largest set for some suitable value of , such that, for , each codeword is a concatenation of sequences of length from and each concatenation does not create a reverse-complement subsequence from previous concatenations. The construction yields a family of DNA codes of rate bits/nt. For example, for , Krishna Gopal Benerjee and Adrish Banerjee [11] constructed an SSA code via such a set .
Theorem 1** (Benerjee and Banerjee [11]).**
Set . Let be the DNA code of length where each codeword is a concatenation of words of length two from . We then have is an SSA code, i.e. every codeword of is -SSA. The size of the code is , and the code rate is bits/nt.
II-A Paper Organisation and Our Main Contribution
Since the number of base pairs in stem regions (or stem length) is one important factor influencing the secondary structure of a DNA sequence, this work aims at providing a rigorous solution for SSA codes given arbitrary . The paper is organised as follows.
- •
Section III presents two efficient constructions of SSA codes for arbitrary . The first construction is based on block concatenation, which concatenates blocks of fixed length from a predetermined set. On the other hand, crucial to the second construction is the concept of symbol-composition constrained codes. Particularly, when , the second construction yields a family of DNA codes of rate bits/nt, which is higher than the code rate in [11].
- •
Section IV presents a linear-time encoding method for SSA code with only one redundant symbol whenever . The coding method is based on sequence replacement technique.
III Constructions of SSA Codes for arbitrary
The first method is based on block concatenation, which concatenates blocks of length from a predetermined set.
III-A Constructions via Block Concatenation
Construction 1**.**
Given , for some integer , set . Let be the set of all DNA sequences of length such that for any pair of sequences , not necessary distinct, there is no pair of subsequences of and of of length such that . Let be the DNA code of length , where each codeword is a concatenation of sequences of length in .
Theorem 2**.**
The constructed code from Construction 1 is an SSA code.
Proof.
We prove the correctness of Theorem 2 by contradiction. Suppose that, there exists a codeword , where , and is not -SSA. In other words, there exists two non-overlapping subsequences , of of length such that .
Suppose that where is a subsequence of , and is a subsequence of for some . We have . The trivial case is if , or is of length more than , then is a subsequence of and is a subsequence of . Clearly, if , we have a contradiction. On the other hand, if where is a subsequence of and is a subsequence of for some , then at least one subsequence or is of size more than , we also have a contradiction. We conclude that , or is simply a subsequence of .
Now, since is of length , at least or . W.l.o.g, assume that .
We observe that cannot be a subsequence of any by Construction 1. In other words, where is a subsequence of and is a subsequence of for some . Similarly, we observe that the length of must be strictly smaller than , otherwise, for example, if the length of is more than or equal to , then two sequences and in contain and as subsequences, we have a contradiction. Since both the length of must be strictly smaller than , causing the length of is smaller than , we conclude that the length of is at least .
Now, let , the subsequence that belongs to both and , which is of size at least . We then have is a subsequence of while is a subsequence of , a subsequence of . We then have a contradiction.
In conclusion, we have is an SSA code. We highlight our proof sketch of Theorem 2 in Figure 5. ∎
Remark 1**.**
Observe that, the set can be constructed via exhaustive search with complexity . In Section IV, we show that when is sufficiently large, , there exists an efficient encoding/decoding algorithm for SSA codes with at most one redundant symbol. Hence, for the case , we can use Construction 1 to construct SSA codes with complexity .
III-B Constructions via Symbol-Composition Constrained Codes
In this subsection, we present an efficient construction for SSA codes by simply restricting the symbol-composition for every subsequence of length . Particularly, when , our method yields a family of DNA codes of rate bits/nt, which is higher than the code rate in [11].
High Level Description. We select a nucleotide , and let . For some , we present an efficient method to construct an SSA code as follows. For every codeword , every subsequence of length of contains at least symbols while contains at most symbols . We refer such a constraint to as the symbol-composition constraint. It is easy to verify that such a constructed code is an SSA code. Clearly, suppose on the other hand, there exists a pair of subsequences of length in , such that . It implies that there exists two subsequences of length , which are of and of , and . Since contains at least symbols , we have must contain at least symbols . We then have a contradiction.
The following construction is for and .
Construction 2** (Symbol-Composition Constrained Codes for , ).**
Given , we select and . Set . Let be the set of all DNA sequences of length from alphabet such that for any , every subsequence of length three of must contain an .
Theorem 3**.**
We have , and
[TABLE]
In addition, is an SSA code for all . The asymptotic rate of this code family is given by , where is the largest real root of .
Proof.
Consider the code . For a codeword , for any subsequence of length of , we have includes . On the other hand, since is not used in , there is no reverse-complement of in . In conclusion, is 3-SSA, or is an SSA code.
We now prove the cardinality of . it is easy to verify that For , we construct recursively as follows:
[TABLE]
In other words, is the set formed by concatenating all sequences in with , is the set formed by concatenating all sequences in with or , and lastly, is the set formed by concatenating all sequences in with or . It is easy to verify that , and the union includes all possible sequences in . Therefore, we have ∎
Construction 2 can be generalized to construct SSA codes with as follows.
Theorem 4** (Symbol-Composition Constrained Codes for General , ).**
Given . Set , and to be the set of all sequences of length from alphabet such that every subsequence of length of include an . We then have for , and
[TABLE]
We then have is an SSA code for all . The asymptotic rate of this code family is given by , where is the largest real root of .
Remark 2**.**
In general, given , set and . we use to denote the set of all sequences such that every subsequence of length of contains at least symbols while contains at most symbols . As shown earlier, is an SSA code for all . A natural question is, for a given number , what is the value of , where , such that the code has the largest cardinality? We defer the study of , including the code’s cardinality and the design of efficient encoding algorithms to map arbitrary DNA sequences into such a code, to future research work.
IV Constructions of SSA Codes for with One Redundant Symbol
In this section, we show that when the stem length is sufficiently large, , there exists an efficient encoding/decoding algorithm for SSA codes with at most one redundant symbol. For simplicity, we assume that is an integer, and define the DNA-representation of an integer as follows.
Definition 4**.**
For a positive integer , the DNA-representation of is the replacement of symbols in the quaternary representation of over by the following rule:
Example 1**.**
If , the quaternary representation of length 4 of is , hence, the DNA-representation of is . Similarly, when , the quaternary representation of length 4 of is , thus the DNA-representation of is .
We now present explicit construction of the encoder and the corresponding decoder . Our method is based on the sequence replacement technique. This method has been widely used in the literature [21, 23, 22]. In addition, we also restrict the length of the repeated patterns of size 2 (also known as pattern length limited (PLL) constraint, as introduced in [24]).
Construction of . Given , , and . Set . The source DNA sequence . The encoding algorithm includes three phases: prepending phase, scanning and replacing phase, and extending phase.
Prepending phase. The source sequence is prepended with , to obtain of length . If is an -SSA sequence, then the encoder outputs . Otherwise, it proceeds to the next phase.
Scanning and replacing phase. The encoder searches for the first pair of non-overlapping subsequences of length of , where , such that , or the first subsequence of of the form whose length is , where .
- •
If it finds a pair of non-overlapping subsequences , suppose that , where are subsequences of , and starts at index , ends at index in , where , and starts at index in . We have .
Type-I Replacement. The encoder sets a pointer , starting with symbol , and , where are the DNA-representation of and , respectively. Since are of length , the pointer sequence is of length . It then removes from and prepends to . The replacing step can be illustrated as follows.
[TABLE]
Noted that the removed sequence is of length , while the insertion pointer is of length . Consequently, such a replacement reduces the length of the current sequence by at least one symbol.
- •
On the other hand, suppose that it finds a subsequence of of the form whose length is , where . We further suppose that , where are subsequences of , and starts at index , and ends at index in , where . We have .
Type-II Replacement. Similarly, the encoder sets a pointer , starting with symbol , and , where are the DNA-representation of and , respectively. Since are of length , the pointer sequence is of length . It then removes from and prepends to . The replacing step can be illustrated as follows.
[TABLE]
Noted that the removed sequence is of length , while the insertion pointer is of length . Hence, such a replacement reduces the length of the current sequence by at least symbols. Observe that for .
The encoder repeats the scanning and replacing steps until the current sequence contains no pair of non-overlapping subsequences of length more than or equal to such that one is the reverse-complement of the other, no subsequence of the form whose length is , or the current sequence is of length . Note that each replacement (either Type-I or Type-II) reduces the length of the current sequence by at least one symbol, and hence, this procedure is guaranteed to terminate. Here, we also note that the order of the scanning step is defined according to the starting index of the corresponding subsequences. In case the first subsequence forming a secondary structure, is also the starting of such a subsequence , the encoder proceeds to type-I replacement.
Extending phase. If the length of the current sequence is where , the encoder appends a suffix of length to obtain a sequence of length . Surprisingly, regardless the choice of the appending suffix, there is an efficient algorithm to decode the source DNA sequence uniquely (refer to the construction of ). Here, we present one efficient method to generate a suitable suffix so that the output codeword remains -SSA.
- •
If is even, we append to the end of .
- •
If is odd, we append to the end of .
Theorem 5**.**
The encoder is correct. In other words, is an -SSA sequence of length for all . The redundancy of is one redundant symbol.
Proof.
Suppose that , and , where is -SSA and the length of the repeated patterns of size 2 in is of length at most , and is the generated suffix of at the extending phase. Consider an arbitrary sequence of length . Suppose that , where is a subsequence of while is a subsequence of . We have the following cases.
- •
If is of length less than (particularly including the case ), hence the length of is more than . Clearly, there is no subsequence in that , as the length of the repeated patterns of size 2 in is of length at most .
- •
If is of length more than or equal to , we also conclude that there is no subsequence in that since is -SSA. ∎
We now present the corresponding decoding algorithm.
Construction of . From a DNA sequence of length , the decoder scans from left to right. If the first symbol is , the decoder simply removes and identifies the last symbols as the source sequence. On the other hand,
- •
if it starts with , the decoder takes the prefix of length and concludes that this prefix is a pointer prepended after a type-I replacement. In other words, the pointer is of the form , where , each is of length . The decoder sets to be the positive integers whose DNA-representations are , respectively and sets to be the subsequence containing the symbols from index to index . It removes the pointer, adds to at index .
- •
if it starts with , the decoder takes the prefix of length and concludes that this prefix is a pointer prepended after a type-II replacement. In other words, the pointer is of the form , where , each is of length . The decoder sets to be the positive integers whose DNA-representations are , respectively. It then removes the pointer, adds to at index .
The decoding procedure terminates when the first symbol is A, and takes the following symbols as the user data.
Complexity analysis. For codeword of length , the time complexity of the encoder (and the corresponding decoder) is linear in , which follows from: the number of replacing operations is at most , which is , and the complexity of the each replacing operation (including the prepending prefix step or converting quaternary representation to DNA-representation of an integer) is constant time .
V Conclusion
We have presented efficient algorithms to construct DNA codes that avoid secondary structure of arbitrary stem length. Particularly, when , we have provided an efficient encoder that incurs only one redundant symbol, and when , our constructions yield a family of DNA codes of rate bits/nt, that improve the previous highest code rate in the literature.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] L. M. Adleman, “Molecular computation of solutions to combinatorial problems,” Science , vol. 266, pp. 1021-1024, Nov. 1994.
- 2[2] G. M. Church, Y. Gao, and S. Kosuri, “Next-generation digital information storage in DNA,” Science , vol. 337, no. 6102, pp. 1628-1628, 2012.
- 3[3] Y. Erlich and D. Zielinski, “DNA fountain enables a robust and efficient storage architecture,” Science , vol. 355, no. 6328, pp. 950-954, 2017.
- 4[4] L. Organick, S. Ang, Y. J. Chen, R. Lopez, S. Yekhanin, K. Makarychev, M. Racz, G. Kamath, P. Gopalan, B. Nguyen, C. Takahashi, S. Newman, H. Y. Parker, C. Rashtchian, K. Stewart, G. Gupta, R. Carlson, J. Mulligan, D. Carmean, G. Seelig, L. Ceze, and K. Strauss, “Random access in large-scale DNA data storage”, Nature Biotechnology , vol. 36, 242–248, 2018.
- 5[5] N. Goldman, P. Bertone, S. Chen, C. Dessimoz, E. M. Le Proust, B. Sipos, and E. Birney, “Towards practical, high-capacity, low-maintenance information storage in synthesized DNA,” Nature , vol. 494, 77-80, 2013.
- 6[6] Y. Benenson, B. Gil, U. Ben-Dor, R. Adar and E. Shapiro, “An autonomous molecular computer for logical control of gene expression,” Nature , vol. 429, pp. 423-429, May 2004.
- 7[7] S. M. H. T. Yazdi, S. M., R. Gabrys and O. Milenkovic, “Portable and error-free DNA-based data storage,” Scientific reports , 7(1), 1-6, 2017.
- 8[8] O. Milenkovic and N. Kashyap, “On the design of codes for DNA computing,” in Coding Cryptogr. , Germany: Springer, Mar. 2006, pp. 100-119.
