On the Design of Codes for DNA Computing: Secondary Structure Avoidance   Codes

Tuan Thanh Nguyen; Kui Cai; Han Mao Kiah; Duc Tu Dao; and Kees A.; Schouhamer Immink

arXiv:2302.13714·cs.IT·February 28, 2023

On the Design of Codes for DNA Computing: Secondary Structure Avoidance Codes

Tuan Thanh Nguyen, Kui Cai, Han Mao Kiah, Duc Tu Dao, and Kees A., Schouhamer Immink

PDF

Open Access

TL;DR

This paper presents explicit constructions of DNA codes that completely avoid secondary structures of any stem length, improving code rates and providing efficient encoding methods for DNA computing applications.

Contribution

The work introduces novel explicit constructions for DNA codes that eliminate secondary structures of arbitrary stem length, surpassing previous code rate limits.

Findings

01

Constructed DNA codes with rate 1.3031 bits/nt for m=3.

02

Achieved efficient encoding with only one redundant symbol for large m.

03

Provided methods to avoid secondary structures of any stem length ≥ m.

Abstract

In this work, we investigate a challenging problem, which has been considered to be an important criterion in designing codewords for DNA computing purposes, namely secondary structure avoidance in single-stranded DNA molecules. In short, secondary structure refers to the tendency of a single-stranded DNA sequence to fold back upon itself, thus becoming inactive in the computation process. While some design criteria that reduces the possibility of secondary structure formation has been proposed by Milenkovic and Kashyap (2006), the main contribution of this work is to provide an explicit construction of DNA codes that completely avoid secondary structure of arbitrary stem length. Formally, given codeword length n and arbitrary integer m>=2, we provide efficient methods to construct DNA codes of length n that avoid secondary structure of any stem length more than or equal to m.…

Equations15

c_{m} = n \to \infty lim \frac{lo g A ( n , D ; m )}{n} .

c_{m} = n \to \infty lim \frac{lo g A ( n , D ; m )}{n} .

∣ C_{n} ∣ = ∣ C_{n - 1} ∣ + 2∣ C_{n - 2} ∣ + 4∣ C_{n - 3} ∣.

∣ C_{n} ∣ = ∣ C_{n - 1} ∣ + 2∣ C_{n - 2} ∣ + 4∣ C_{n - 3} ∣.

S_{n}^{1} =

S_{n}^{1} =

S_{n}^{2} =

S_{n}^{3} =

C_{n} =

∣ C_{n} (m) ∣ = j = 0 \sum m - 1 2^{j} ∣ C_{n - j - 1} (m) ∣ for n ⩾ m .

∣ C_{n} (m) ∣ = j = 0 \sum m - 1 2^{j} ∣ C_{n - j - 1} (m) ∣ for n ⩾ m .

\mathbfsl{X}_{1}{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}{{\mathbfsl y}}}\mathbfsl{X}_{2}{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}{{\mathbfsl{z}}}}\mathbfsl{X}_{3}\to\mathbfsl{X}_{1}{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}{{\mathbfsl y}}}\mathbfsl{X}_{2}\mathbfsl{X}_{3}\to{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}{{\bf T}{\mathbfsl p}_{1}{\mathbfsl p}_{2}{\mathbfsl p}_{3}}}\mathbfsl{X}_{1}{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}{{\mathbfsl y}}}\mathbfsl{X}_{2}\mathbfsl{X}_{3}

\mathbfsl{X}_{1}{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}{{\mathbfsl y}}}\mathbfsl{X}_{2}{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}{{\mathbfsl{z}}}}\mathbfsl{X}_{3}\to\mathbfsl{X}_{1}{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}{{\mathbfsl y}}}\mathbfsl{X}_{2}\mathbfsl{X}_{3}\to{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}{{\bf T}{\mathbfsl p}_{1}{\mathbfsl p}_{2}{\mathbfsl p}_{3}}}\mathbfsl{X}_{1}{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}{{\mathbfsl y}}}\mathbfsl{X}_{2}\mathbfsl{X}_{3}

\mathbfsl{U}_{1}{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}{(x_{1}x_{2})^{t}}}\mathbfsl{U}_{2}\to\mathbfsl{U}_{1}\mathbfsl{U}_{2}\to{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}{{\bf C}x_{1}x_{2}{\mathbfsl q}_{1}{\mathbfsl q}_{2}}}\mathbfsl{U}_{1}\mathbfsl{U}_{2}.

\mathbfsl{U}_{1}{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}{(x_{1}x_{2})^{t}}}\mathbfsl{U}_{2}\to\mathbfsl{U}_{1}\mathbfsl{U}_{2}\to{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}{{\bf C}x_{1}x_{2}{\mathbfsl q}_{1}{\mathbfsl q}_{2}}}\mathbfsl{U}_{1}\mathbfsl{U}_{2}.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced biosensing and bioanalysis techniques · DNA and Biological Computing · DNA and Nucleic Acid Chemistry

Full text

On the Design of Codes for DNA Computing: Secondary Structure Avoidance Codes

Tuan Thanh Nguyen1, Kui Cai1, Han Mao Kiah2, Duc Tu Dao2, and Kees A. Schouhamer Immink3

1 Science, Mathematics and Technology Cluster, Singapore University of Technology and Design, Singapore 487372

2School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371

3Turing Machines Inc, Willemskade 15d, 3016 DK Rotterdam, The Netherlands

Emails: {tuanthanh_nguyen, cai_kui}@sutd.edu.sg, {hmkiah,daoductu001}@ntu.edu.sg, [email protected]

Abstract

In this work, we investigate a challenging problem, which has been considered to be an important criterion in designing codewords for DNA computing purposes, namely secondary structure avoidance in single-stranded DNA molecules. In short, secondary structure refers to the tendency of a single-stranded DNA sequence to fold back upon itself, thus becoming inactive in the computation process. While some design criteria that reduces the possibility of secondary structure formation has been proposed by Milenkovic and Kashyap (2006), the main contribution of this work is to provide an explicit construction of DNA codes that completely avoid secondary structure of arbitrary stem length.

Formally, given codeword length $n$ and arbitrary integer $m\geqslant 2$ , we provide efficient methods to construct DNA codes of length $n$ that avoid secondary structure of any stem length more than or equal to $m$ . Particularly, when $m=3$ , our constructions yield a family of DNA codes of rate 1.3031 bits/nt, while the highest rate found in the prior art was 1.1609 bits/nt. In addition, for $m\geqslant 3\log n+4$ , we provide an efficient encoder that incurs only one redundant symbol.

I Introduction

DNA computing is an emerging branch of computing that uses DNA, biochemistry, and molecular biology hardware. The field of DNA computation started with the following demonstration by Adleman in 1994 [1]. In this seminal experiment, Adleman solved an instance of the directed traveling salesperson problem by first representing each city with a synthetic DNA molecule. Then by allowing the strands to hybridize in a highly parallel fashion, Adleman obtained the desired solution. Since then, similar methods have been expanded to several attractive applications, including the development of storage technologies [2, 3, 4, 5], and cell-based computation systems for cancer diagnostics and treatment [6]. Recently, the hybridization process was exploited to allow random access in DNA data storage [7].

In DNA computing, only short single-stranded DNA sequences (or oligonucleotide sequences) are used, where each of them is an oriented word consisting of four bases (or nucleotides): Adenine (A), Thymine (T), Cytosine (C), and Guanine (G). A set of encoded DNA sequences (also called DNA codewords), that satisfies certain special properties (or constraints) for DNA computing purposes, is called a DNA code. A broad description of the kinds of constraint problems that arise in coding for DNA computing was introduced by Milenkovic and Kashyap in 2006 [8], including constant GC-content constraint (refers to the percentage of nucleotides that are either G or C), Hamming distance constraint (that requires DNA codewords to be sufficiently different among themselves), and secondary structure formation avoidance constraint (that prevents DNA sequence to fold back upon itself, and consequently becoming inactive in the computation process). Similar considerations were described in [9, 10] for the design of primer address sequences in random access of DNA-based data storage systems. While constant GC-content constraint and Hamming distance constraint have been extensively investigated [11, 12, 13, 8, 14, 15, 16, 17], the study for secondary structure avoidance is much less profound.

For a DNA sequence, a secondary structure is formed by a chemically active to fold back onto itself by complementary base pair hybridization (illustrated via Figure 1). Here, the Watson-Crick complement is defined as: $\overline{{\bf A}}={\bf T},\overline{\bf T}={\bf A},\overline{\bf C}={\bf G}$ , and $\overline{\bf G}={\bf C}$ . For a sequence ${\mathbfsl{x}}=x_{1}x_{2}x_{3}\ldots x_{n-1}x_{n}$ over the DNA alphabet ${\mathcal{D}}=\{{\bf A},{\bf T},{\bf C},{\bf G}\}$ , the reverse-complement of ${\mathbfsl{x}}$ is defined as ${\rm RC}({{\mathbfsl{x}}})={\overline{x_{n}}}\text{ }\overline{x_{n-1}}\ldots\overline{x_{3}}\text{ }\overline{x_{2}}\text{ }\overline{x_{1}}$ . In Figure 1, sub-sequences ${\color[rgb]{0,0.88,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.88,0}\pgfsys@color@cmyk@stroke{0.91}{0}{0.88}{0.12}\pgfsys@color@cmyk@fill{0.91}{0}{0.88}{0.12}{{\mathbfsl{x}}={\bf A}{\bf T}{\bf A}{\bf C}{\bf C}}}$ and ${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}{{\mathbfsl y}={\rm RC}({{\mathbfsl{x}}})={\bf G}{\bf G}{\bf T}{\bf A}{\bf T}}}$ of the DNA sequence $\sigma$ bind to each other after pairing of A with T and G with C, forming a secondary structure with a loop and a stem of length 5. DNA sequences with secondary structures are less active in the computation process [8], and hence, before reading such sequences in a wet lab, they need to be unfolded, costing more resources and energy. There exist some simple dynamic programming techniques [18, 19] that can approximately predict the secondary structures in a given DNA sequence (for example, see the Nussinov-Jacobson (NJ) algorithm in [19] as one of the most widely used schemes). Based on the NJ algorithm, the authors in [8, 13] found some design criteria that reduce the possibility of secondary structure formation in a codeword. A natural question is whether there exists efficient design of DNA constrained codes that avoid the formation of secondary structures.

It has been shown experimentally that the number of base pairs in stem regions (or stem length) is one important factor influencing the secondary structure of a DNA sequence. Given codeword length $n$ and an integer $m\geqslant 2$ , we study the problem of constructing DNA codes of length $n$ that avoid secondary structure of any stem length more than or equal to $m$ . To the best of our knowledge, this work is the first attempt aimed at providing a rigorous solution for DNA codes avoiding secondary structure for general stem lengths.

II Preliminary

In this work, we use ${\mathcal{D}}$ to denote the DNA alphabet, where ${\mathcal{D}}=\{{\bf A},{\bf T},{\bf C},{\bf G}\}$ . Here, we have the Watson-Crick complement where $\overline{{\bf A}}={\bf T},\overline{\bf T}={\bf A},\overline{\bf C}={\bf G}$ , and $\overline{\bf G}={\bf C}$ .

Given two sequences ${\mathbfsl{x}}$ and ${\mathbfsl y}$ , we let ${\mathbfsl{x}}{\mathbfsl y}$ denote the concatenation of the two sequences.

Throughout this work, given a sequence ${\mathbfsl{x}}$ of length $n$ , we say ${\mathbfsl y}$ is a subsequence of length $k$ of ${\mathbfsl{x}}$ , where $k\leqslant n$ , if ${\mathbfsl y}=x_{i}x_{i+1}\ldots x_{i+k-1}$ for some $1\leqslant i\leqslant n-k+1$ . In other words, we only consider the subsequences including consecutive symbols in ${\mathbfsl{x}}$ . Two subsequences ${\mathbfsl y}$ and ${\mathbfsl{z}}$ of ${\mathbfsl{x}}$ are said to be non-overlapping if we have ${\mathbfsl y}=x_{i}x_{i+1}\ldots x_{i+k-1}$ , ${\mathbfsl{z}}=x_{j}x_{j+1}\ldots x_{j+\ell-1}$ , where $i>j+\ell-1$ or $j>i+k-1$ .

Definition 1.

For a DNA sequence ${\mathbfsl{x}}\in{\mathcal{D}}^{n}$ , ${\mathbfsl{x}}=x_{1}x_{2}\ldots x_{n}$ , the reverse-complement of ${\mathbfsl{x}}$ , is defined as ${\rm RC}({{\mathbfsl{x}}})={\overline{x_{n}}}\text{ }\overline{x_{n-1}}\ldots\overline{x_{3}}\text{ }\overline{x_{2}}\text{ }\overline{x_{1}}$ .

Definition 2.

Given $0<m\leqslant n$ , a DNA sequence ${\mathbfsl{x}}\in{\mathcal{D}}^{n}$ is said to be $m$ -secondary structure avoidance (or $m$ -SSA) sequence if for all $k\geqslant m$ , there does not exist any pair of non-overlapping subsequences ${\mathbfsl y},{\mathbfsl{z}}$ of length $k$ of ${\mathbfsl{x}}$ such that ${\mathbfsl y}={\rm RC}({{\mathbfsl{z}}})$ . A code ${\mathcal{C}}$ is said to be an $(n,{\mathcal{D}};m)$ SSA code if for every codeword ${\mathbfsl{x}}\in{\mathcal{C}}\cap{\mathcal{D}}^{n}$ , we have ${\mathbfsl{x}}$ is $m$ -SSA.

The following result is immediate.

Lemma 1.

Given $m,n>0$ , if a sequence ${\mathbfsl{x}}\in{\mathcal{D}}^{n}$ is $m$ -SSA then ${\mathbfsl{x}}$ is $m^{\prime}$ -SSA for all $m^{\prime}>m$ .

For a code ${\mathcal{C}}\subseteq{\mathcal{D}}^{n}$ , the code rate is measured by the value $\log|{\mathcal{C}}|/n$ . Intuitively, it measures the number of information bits stored in each DNA symbol. Suppose that we have an infinite family of codes $\{{\mathcal{C}}_{n}\}_{n=1}^{\infty}$ , where ${\mathcal{C}}_{n}$ is a code of length $n$ , then the asymptotic rate of the family is ${\bf r}\triangleq\lim_{n\to\infty}\frac{\log|{\mathcal{C}}_{n}|}{n}$ . Here, we adopt the notation $\log$ to mean logarithm base two.

Definition 3.

Given $m>0$ , for $n>0$ , let ${\rm A}(n,{\mathcal{D}};m)$ be the total number of DNA sequences of length $n$ that are $m$ -SSA. The channel capacity, denoted by ${\rm c}_{m}$ , is defined by:

[TABLE]

The following result is immediate.

Lemma 2.

Given $m>0$ , let $S_{m}$ be the set of all DNA sequences of length $m$ such that, there is no pair of sequences ${\mathbfsl y},{\mathbfsl{z}}\in S_{m}$ , not necessary distinct, such that ${\mathbfsl y}={\rm RC}({\mathbfsl{z}})$ . We then have ${\rm c}_{m}\leqslant 1/m\log|S_{m}|$ .

Observe that the size of $S_{m}$ can be computed easily for constant $m$ , a trivial upper bound is that $|S_{m}|\leqslant 4^{m}/2$ , and consequently, we obtain ${\rm c}_{2}\leqslant 1.5$ and ${\rm c}_{3}\leqslant 1.67$ .

To construct an $(n,{\mathcal{D}};m)$ SSA code for arbitrary $m>0$ by concatenation method, one can find the largest set $S_{N}$ for some suitable value of $N$ , such that, for $n=Nk$ , each codeword is a concatenation of $k$ sequences of length $N$ from $S_{N}$ and each concatenation does not create a reverse-complement subsequence from previous concatenations. The construction yields a family of DNA codes of rate $1/N\log|S_{N}|$ bits/nt. For example, for $m=3$ , Krishna Gopal Benerjee and Adrish Banerjee [11] constructed an $(n,{\mathcal{D}};3)$ SSA code via such a set $S=\{{\bf A}{\bf A},{\bf C}{\bf C},{\bf A}{\bf C},{\bf C}{\bf A},{\bf T}{\bf C}\}$ .

Theorem 1 (Benerjee and Banerjee [11]).

Set $S=\{{\bf A}{\bf A},{\bf C}{\bf C},{\bf A}{\bf C},{\bf C}{\bf A},{\bf T}{\bf C}\}$ . Let ${\mathcal{C}}$ be the DNA code of length $2n$ where each codeword is a concatenation of words of length two from $S$ . We then have ${\mathcal{C}}$ is an $(n,{\mathcal{D}};3)$ SSA code, i.e. every codeword of ${\mathcal{C}}$ is $3$ -SSA. The size of the code is $|{\mathcal{C}}|=5^{n}$ , and the code rate is $1/2\log 5=1.1609$ bits/nt.

II-A Paper Organisation and Our Main Contribution

Since the number of base pairs in stem regions (or stem length) is one important factor influencing the secondary structure of a DNA sequence, this work aims at providing a rigorous solution for $(n,{\mathcal{D}};m)$ SSA codes given arbitrary $m$ . The paper is organised as follows.

•

Section III presents two efficient constructions of $(n,{\mathcal{D}};m)$ SSA codes for arbitrary $m>0$ . The first construction is based on block concatenation, which concatenates blocks of fixed length $m$ from a predetermined set. On the other hand, crucial to the second construction is the concept of symbol-composition constrained codes. Particularly, when $m=3$ , the second construction yields a family of DNA codes of rate $1.3031$ bits/nt, which is higher than the code rate in [11].

•

Section IV presents a linear-time encoding method for $(n,{\mathcal{D}};m)$ SSA code with only one redundant symbol whenever $m\geqslant 3\log n+4$ . The coding method is based on sequence replacement technique.

III Constructions of $(n,{\mathcal{D}};m)$ SSA Codes for arbitrary $m>0$

The first method is based on block concatenation, which concatenates blocks of length $m$ from a predetermined set.

III-A Constructions via Block Concatenation

Construction 1.

Given $m>0$ , $n=mk$ for some integer $k>0$ , set $t=\lceil m/3\rceil$ . Let $S_{m}^{*}$ be the set of all DNA sequences of length $m$ such that for any pair of sequences ${\mathbfsl{x}}_{1},{\mathbfsl{x}}_{2}\in S_{m}^{*}$ , not necessary distinct, there is no pair of subsequences ${\mathbfsl y}$ of ${\mathbfsl{x}}_{1}$ and ${\mathbfsl{z}}$ of ${\mathbfsl{x}}_{2}$ of length $t$ such that ${\mathbfsl y}={\rm RC}({\mathbfsl{z}})$ . Let ${\mathcal{C}}$ be the DNA code of length $n$ , where each codeword is a concatenation of $k$ sequences of length $m$ in $S_{m}^{*}$ .

Theorem 2.

The constructed code ${\mathcal{C}}$ from Construction 1 is an $(n,{\mathcal{D}};m)$ SSA code.

Proof.

We prove the correctness of Theorem 2 by contradiction. Suppose that, there exists a codeword ${\mathbfsl c}\in{\mathcal{C}},{\mathbfsl c}={\mathbfsl{x}}_{1}{\mathbfsl{x}}_{2}\ldots{\mathbfsl{x}}_{k}$ , where ${\mathbfsl{x}}_{i}\in S_{m}^{*}$ , and ${\mathbfsl c}$ is not $m$ -SSA. In other words, there exists two non-overlapping subsequences ${\mathbfsl y}$ , ${\mathbfsl{z}}$ of ${\mathbfsl c}$ of length $m^{\prime}\geqslant m$ such that ${\mathbfsl y}={\rm RC}({\mathbfsl{z}})$ .

Suppose that ${\mathbfsl y}=Y_{1}Y_{2}$ where $Y_{1}$ is a subsequence of ${\mathbfsl{x}}_{i}$ , and $Y_{2}$ is a subsequence of ${\mathbfsl{x}}_{i+1}{\mathbfsl{x}}_{i+2}\ldots{\mathbfsl{x}}_{i+h}$ for some $h\geqslant 1$ . We have ${\mathbfsl{z}}={\rm RC}(Y_{2}){\rm RC}(Y_{1})$ . The trivial case is if $h>1$ , or $Y_{2}$ is of length more than $m$ , then ${\mathbfsl{x}}_{i+1}$ is a subsequence of $Y_{2}$ and ${\rm RC}({\mathbfsl{x}}_{i+1})$ is a subsequence of ${\mathbfsl{z}}$ . Clearly, if ${\rm RC}({\mathbfsl{x}}_{i+1})\equiv{\mathbfsl{x}}_{j}$ , we have a contradiction. On the other hand, if ${\rm RC}({\mathbfsl{x}}_{i+1})=W_{1}W_{2}$ where $W_{1}$ is a subsequence of ${\mathbfsl{x}}_{j}$ and $W_{2}$ is a subsequence of ${\mathbfsl{x}}_{j+1}$ for some $j$ , then at least one subsequence $W_{1}$ or $W_{2}$ is of size more than $t$ , we also have a contradiction. We conclude that $h=1$ , or $Y_{2}$ is simply a subsequence of ${\mathbfsl{x}}_{i+1}$ .

Now, since ${\mathbfsl y}=Y_{1}Y_{2}$ is of length $m^{\prime}\geqslant m$ , at least $Y_{1}\geqslant t$ or $Y_{2}\geqslant t$ . W.l.o.g, assume that $Y_{1}\geqslant t$ .

We observe that ${\rm RC}(Y_{1})$ cannot be a subsequence of any ${\mathbfsl{x}}_{j}$ by Construction 1. In other words, ${\rm RC}(Y_{1})=W_{1}W_{2}$ where $W_{1}$ is a subsequence of ${\mathbfsl{x}}_{j}$ and $W_{2}$ is a subsequence of ${\mathbfsl{x}}_{j+1}$ for some $j$ . Similarly, we observe that the length of $W_{1},W_{2}$ must be strictly smaller than $t$ , otherwise, for example, if the length of $W_{1}$ is more than or equal to $t$ , then two sequences ${\mathbfsl{x}}_{i}$ and ${\mathbfsl{x}}_{j}$ in $S_{m}^{*}$ contain ${\rm RC}(W_{1})$ and $W_{1}$ as subsequences, we have a contradiction. Since both the length of $W_{1},W_{2}$ must be strictly smaller than $t$ , causing the length of $Y_{1}$ is smaller than $2t$ , we conclude that the length of $Y_{2}$ is at least $t$ .

Now, let $\mathbfsl{U}={\rm RC}(Y_{2})\cap{\mathbfsl{x}}_{j+1}$ , the subsequence that belongs to both ${\mathbfsl{x}}_{j+1}$ and ${\rm RC}(Y_{2})$ , which is of size at least $t$ . We then have $\mathbfsl{U}$ is a subsequence of ${\mathbfsl{x}}_{j+1}$ while ${\rm RC}(\mathbfsl{U})$ is a subsequence of ${\rm RC}({\rm RC}(Y_{2}))=Y_{2}$ , a subsequence of ${\mathbfsl{x}}_{i+1}$ . We then have a contradiction.

In conclusion, we have ${\mathcal{C}}$ is an $(n,{\mathcal{D}};m)$ SSA code. We highlight our proof sketch of Theorem 2 in Figure 5. ∎

Remark 1.

Observe that, the set $S_{m}^{*}$ can be constructed via exhaustive search with complexity $O(2^{m})$ . In Section IV, we show that when $m$ is sufficiently large, $m\geqslant 3\log n+4=\Theta(\log n)$ , there exists an efficient encoding/decoding algorithm for $(n,{\mathcal{D}};m)$ SSA codes with at most one redundant symbol. Hence, for the case $m=o(\log n)$ , we can use Construction 1 to construct $(n,{\mathcal{D}};m)$ SSA codes with complexity $2^{m}=\Theta(n)$ .

III-B Constructions via Symbol-Composition Constrained Codes

In this subsection, we present an efficient construction for $(n,{\mathcal{D}};m)$ SSA codes by simply restricting the symbol-composition for every subsequence of length $m$ . Particularly, when $m=3$ , our method yields a family of DNA codes of rate $1.3031$ bits/nt, which is higher than the code rate in [11].

High Level Description. We select a nucleotide $x\in{\mathcal{D}}=\{{\bf A},{\bf T},{\bf C},{\bf G}\}$ , and let $y=\overline{x}\in{\mathcal{D}}$ . For some $0<k\leqslant m$ , we present an efficient method to construct an $(n,{\mathcal{D}};m)$ SSA code ${\mathcal{C}}$ as follows. For every codeword ${\mathbfsl c}\in{\mathcal{C}}$ , every subsequence ${\mathbfsl{z}}$ of length $m$ of ${\mathbfsl c}$ contains at least $k$ symbols $x$ while ${\mathbfsl{z}}$ contains at most $(k-1)$ symbols $y$ . We refer such a constraint to as the symbol-composition constraint. It is easy to verify that such a constructed code ${\mathcal{C}}$ is an $(n,{\mathcal{D}};m)$ SSA code. Clearly, suppose on the other hand, there exists a pair of subsequences ${\mathbfsl{z}}_{1},{\mathbfsl{z}}_{2}$ of length $\ell\geqslant m$ in ${\mathbfsl c}\in{\mathcal{C}}$ , such that ${\mathbfsl{z}}_{2}={\rm RC}({\mathbfsl{z}}_{1})$ . It implies that there exists two subsequences of length $m$ , which are ${\mathbfsl{z}}_{1}^{\prime}$ of ${\mathbfsl{z}}_{1}$ and ${\mathbfsl{z}}_{2}^{\prime}$ of ${\mathbfsl{z}}_{2}$ , and ${\mathbfsl{z}}_{2}^{\prime}={\rm RC}({\mathbfsl{z}}_{1}^{\prime})$ . Since ${\mathbfsl{z}}_{1}^{\prime}$ contains at least $k$ symbols $x$ , we have ${\mathbfsl{z}}_{2}^{\prime}={\rm RC}({\mathbfsl{z}}_{1}^{\prime})$ must contain at least $k$ symbols $y=\overline{x}$ . We then have a contradiction.

The following construction is for $m=3$ and $k=1$ .

Construction 2 (Symbol-Composition Constrained Codes for $m=3$ , $k=1$ ).

Given $n>0$ , we select $x={\bf A}$ and $y=\overline{x}={\bf T}$ . Set ${\mathcal{D}}^{*}=\{{\bf A},{\bf C},{\bf G}\}$ . Let ${\mathcal{C}}_{n}$ be the set of all DNA sequences of length $n$ from alphabet ${\mathcal{D}}^{*}$ such that for any ${\mathbfsl c}\in{\mathcal{C}}_{n}$ , every subsequence of length three of ${\mathbfsl c}$ must contain an ${\bf A}$ .

Theorem 3.

We have $|{\mathcal{C}}_{1}|=3,|{\mathcal{C}}_{2}|=9,|{\mathcal{C}}_{3}|=19$ , and

[TABLE]

In addition, ${\mathcal{C}}_{n}$ is an $(n,{\mathcal{D}};3)$ SSA code for all $n>0$ . The asymptotic rate of this code family is given by $\log(\lambda)\approx 1.3031$ , where $\lambda\approx 2.4675$ is the largest real root of $x^{3}-x^{2}-2x-4=0$ .

Proof.

Consider the code ${\mathcal{C}}_{n}$ . For a codeword ${\mathbfsl c}\in{\mathcal{C}}_{n}$ , for any subsequence ${\mathbfsl{x}}$ of length $\ell\geqslant 3$ of ${\mathbfsl c}$ , we have ${\mathbfsl{x}}$ includes ${\bf A}$ . On the other hand, since $\overline{\bf A}={\bf T}$ is not used in ${\mathbfsl c}$ , there is no reverse-complement of ${\mathbfsl{x}}$ in ${\mathbfsl c}$ . In conclusion, ${\mathbfsl c}$ is 3-SSA, or ${\mathcal{C}}_{n}$ is an $(n,{\mathcal{D}};3)$ SSA code.

We now prove the cardinality of ${\mathcal{C}}_{n}$ . it is easy to verify that $|{\mathcal{C}}_{1}|=3,|{\mathcal{C}}_{2}|=9,|{\mathcal{C}}_{3}|=19.$ For $n\geqslant 4$ , we construct ${\mathcal{C}}_{n}$ recursively as follows:

[TABLE]

In other words, $S^{1}_{n}$ is the set formed by concatenating all sequences in ${\mathcal{C}}_{n-1}$ with ${\bf A}$ , $S^{2}_{n}$ is the set formed by concatenating all sequences in ${\mathcal{C}}_{n-2}$ with ${\bf A}{\bf C}$ or ${\bf A}{\bf G}$ , and lastly, $S^{2}_{n}$ is the set formed by concatenating all sequences in ${\mathcal{C}}_{n-3}$ with ${\bf A}{\bf C}{\bf C},{\bf A}{\bf C}{\bf G},{\bf A}{\bf G}{\bf C},$ or ${\bf A}{\bf G}{\bf G}$ . It is easy to verify that $S^{i}_{n}\cap S^{j}_{n}\equiv\emptyset$ , and the union $S^{1}_{n}\cup S^{2}_{n}\cup S^{3}_{n}$ includes all possible sequences in ${\mathcal{C}}_{n}$ . Therefore, we have $|{\mathcal{C}}_{n}|=|{\mathcal{C}}_{n-1}|+2|{\mathcal{C}}_{n-2}|+4|{\mathcal{C}}_{n-3}|.$ ∎

Construction 2 can be generalized to construct $(n,{\mathcal{D}};m)$ SSA codes with $k=1$ as follows.

Theorem 4 (Symbol-Composition Constrained Codes for General $m$ , $k=1$ ).

Given $n,m>0$ . Set ${\mathcal{D}}^{*}=\{{\bf A},{\bf C},{\bf G}\}$ , and ${\mathcal{C}}_{n}(m)$ to be the set of all sequences ${\mathbfsl{x}}$ of length $n$ from alphabet ${\mathcal{D}}^{*}$ such that every subsequence of length $m$ of ${\mathbfsl{x}}$ include an ${\bf A}$ . We then have $|{\mathcal{C}}_{i}(m)|=3^{i}$ for $0\leqslant i\leqslant m-1$ , and

[TABLE]

We then have ${\mathcal{C}}_{n}(m)$ is an $(n,{\mathcal{D}};m)$ SSA code for all $n>0$ . The asymptotic rate of this code family is given by $\log(\lambda)$ , where $\lambda$ is the largest real root of $x^{m}-\sum_{j=0}^{m-1}2^{j}x^{m-j}=0$ .

Remark 2.

In general, given $m>k>0$ , set $x={\bf A}$ and $y=\overline{x}={\bf T}$ . we use ${\mathcal{C}}_{n}(m,k)$ to denote the set of all sequences ${\mathbfsl c}\in{\mathcal{D}}^{n}$ such that every subsequence ${\mathbfsl{z}}$ of length $m$ of ${\mathbfsl c}$ contains at least $k$ symbols ${\bf A}$ while ${\mathbfsl{z}}$ contains at most $(k-1)$ symbols ${\bf T}$ . As shown earlier, ${\mathcal{C}}_{n}(m,k)$ is an $(n,{\mathcal{D}};m)$ SSA code for all $m,k$ . A natural question is, for a given number $m>0$ , what is the value of $k$ , where $1\leqslant k\leqslant m$ , such that the code ${\mathcal{C}}_{n}(m,k)$ has the largest cardinality? We defer the study of ${\mathcal{C}}_{n}(m,k)$ , including the code’s cardinality and the design of efficient encoding algorithms to map arbitrary DNA sequences into such a code, to future research work.

IV Constructions of $(n,{\mathcal{D}};m)$ SSA Codes for $m\geqslant 3\log n+4$ with One Redundant Symbol

In this section, we show that when the stem length is sufficiently large, $m\geqslant 3\log n+4=\Theta(\log n)$ , there exists an efficient encoding/decoding algorithm for $(n,{\mathcal{D}};m)$ SSA codes with at most one redundant symbol. For simplicity, we assume that $\log_{4}n$ is an integer, and define the DNA-representation of an integer as follows.

Definition 4.

For a positive integer $N$ , the DNA-representation of $N$ is the replacement of symbols in the quaternary representation of $N$ over $\Sigma_{4}=\{0,1,2,3\}$ by the following rule: $0\leftrightarrow{\bf A},1\leftrightarrow{\bf T},2\leftrightarrow{\bf C},\text{ and }3\leftrightarrow{\bf G}.$

Example 1.

If $N=100$ , the quaternary representation of length 4 of $N$ is $1210$ , hence, the DNA-representation of $N$ is ${\bf T}{\bf C}{\bf T}{\bf A}$ . Similarly, when $N=55$ , the quaternary representation of length 4 of $N$ is $0313$ , thus the DNA-representation of $N$ is ${\bf A}{\bf G}{\bf T}{\bf G}$ .

We now present explicit construction of the encoder $\textsc{Enc}_{{\rm SSA}}$ and the corresponding decoder $\textsc{Dec}_{\rm SSA}$ . Our method is based on the sequence replacement technique. This method has been widely used in the literature [21, 23, 22]. In addition, we also restrict the length of the repeated patterns of size 2 (also known as pattern length limited (PLL) constraint, as introduced in [24]).

Construction of $\textsc{Enc}_{{\rm SSA}}$ . Given $n>m>0$ , $n>16$ , and $m\geqslant 3\log n+4$ . Set $m^{\prime}=1.5\log n+2$ . The source DNA sequence ${\mathbfsl{x}}\in{\mathcal{D}}^{n-1}$ . The encoding algorithm includes three phases: prepending phase, scanning and replacing phase, and extending phase.

Prepending phase. The source sequence ${\mathbfsl{x}}\in{\mathcal{D}}^{n-1}$ is prepended with ${\bf A}$ , to obtain ${\mathbfsl c}={\bf A}{\mathbfsl{x}}$ of length $n$ . If ${\mathbfsl c}$ is an $m$ -SSA sequence, then the encoder outputs ${\mathbfsl c}$ . Otherwise, it proceeds to the next phase.

Scanning and replacing phase. The encoder searches for the first pair of non-overlapping subsequences ${\mathbfsl y},{\mathbfsl{z}}$ of length $\ell_{1}$ of ${\mathbfsl c}$ , where $\ell_{1}\geqslant m^{\prime}$ , such that ${\mathbfsl y}={\rm RC}({\mathbfsl{z}})$ , or the first subsequence ${\mathbfsl u}$ of ${\mathbfsl c}$ of the form ${\mathbfsl u}=(x_{1}x_{2})^{t}$ whose length is $\ell_{2}=2t\geqslant m^{\prime}=1.5\log n+2$ , where $x_{1},x_{2}\in{\mathcal{D}}=\{{\bf A},{\bf T},{\bf C},{\bf G}\}$ .

•

If it finds a pair of non-overlapping subsequences ${\mathbfsl y},{\mathbfsl{z}}$ , suppose that ${\mathbfsl c}=\mathbfsl{X}_{1}{\mathbfsl y}\mathbfsl{X}_{2}{\mathbfsl{z}}\mathbfsl{X}_{3}$ , where $\mathbfsl{X}_{1},\mathbfsl{X}_{2},\mathbfsl{X}_{3}$ are subsequences of ${\mathbfsl c}$ , and ${\mathbfsl y}$ starts at index $i$ , ends at index $j$ in ${\mathbfsl c}$ , where $j=i+\ell_{1}-1$ , and ${\mathbfsl{z}}$ starts at index $k$ in ${\mathbfsl c}$ . We have $i,j,k\leqslant n-1$ .

Type-I Replacement. The encoder sets a pointer ${\rm P_{I}}$ , starting with symbol ${\bf T}$ , and ${\rm P_{I}}={\bf T}{\mathbfsl p}_{1}{\mathbfsl p}_{2}{\mathbfsl p}_{3}$ , where ${\mathbfsl p}_{1},{\mathbfsl p}_{2},{\mathbfsl p}_{3}$ are the DNA-representation of $i,j,$ and $k$ , respectively. Since ${\mathbfsl p}_{1},{\mathbfsl p}_{2},{\mathbfsl p}_{3}$ are of length $\log_{4}n$ , the pointer sequence ${\rm P}_{\rm I}$ is of length $1+3\log_{4}n=1+1.5\log n$ . It then removes ${\mathbfsl{z}}$ from ${\mathbfsl c}$ and prepends ${\rm P_{I}}$ to ${\mathbfsl c}$ . The replacing step can be illustrated as follows.

[TABLE]

Noted that the removed sequence ${\mathbfsl{z}}$ is of length $\ell_{1}\geqslant m^{\prime}=1.5\log n+2$ , while the insertion pointer ${\rm P}_{I}$ is of length $1.5\log n+1$ . Consequently, such a replacement reduces the length of the current sequence by at least one symbol.

•

On the other hand, suppose that it finds a subsequence ${\mathbfsl u}$ of ${\mathbfsl c}$ of the form ${\mathbfsl u}=(x_{1}x_{2})^{t}$ whose length is $\ell_{2}=2t\geqslant m^{\prime}$ , where $x_{1},x_{2}\in{\mathcal{D}}=\{{\bf A},{\bf T},{\bf C},{\bf G}\}$ . We further suppose that ${\mathbfsl c}=\mathbfsl{U}_{1}(x_{1}x_{2})^{t}\mathbfsl{U}_{2}$ , where $\mathbfsl{U}_{1},\mathbfsl{U}_{2}$ are subsequences of ${\mathbfsl c}$ , and ${\mathbfsl u}$ starts at index $i$ , and ends at index $j$ in ${\mathbfsl c}$ , where $j=i+\ell_{2}-1$ . We have $i,j\leqslant n-1$ .

Type-II Replacement. Similarly, the encoder sets a pointer ${\rm P_{II}}$ , starting with symbol ${\bf C}$ , and ${\rm P_{II}}={\bf C}x_{1}x_{2}{\mathbfsl q}_{1}{\mathbfsl q}_{2}$ , where ${\mathbfsl q}_{1},{\mathbfsl q}_{2}$ are the DNA-representation of $i$ and $j$ , respectively. Since ${\mathbfsl q}_{1},{\mathbfsl q}_{2}$ are of length $\log_{4}n$ , the pointer sequence ${\rm P_{II}}$ is of length $1+2+2\log_{4}n=3+\log n$ . It then removes $(x_{1}x_{2})^{\ell_{2}}$ from ${\mathbfsl c}$ and prepends ${\rm P_{II}}$ to ${\mathbfsl c}$ . The replacing step can be illustrated as follows.

[TABLE]

Noted that the removed sequence is of length $\ell_{2}\geqslant m^{\prime}=1.5\log n+2$ , while the insertion pointer ${\rm P_{II}}$ is of length $\log n+3$ . Hence, such a replacement reduces the length of the current sequence by at least $(0.5\log n-1)$ symbols. Observe that $0.5\log n-1>1$ for $n>16$ .

The encoder repeats the scanning and replacing steps until the current sequence ${\mathbfsl c}$ contains no pair of non-overlapping subsequences of length more than or equal to $m^{\prime}$ such that one is the reverse-complement of the other, no subsequence ${\mathbfsl u}$ of the form ${\mathbfsl u}=(x_{1}x_{2})^{t}$ whose length is $\ell_{2}=2t\geqslant m^{\prime}$ , or the current sequence is of length $m^{\prime}-1$ . Note that each replacement (either Type-I or Type-II) reduces the length of the current sequence by at least one symbol, and hence, this procedure is guaranteed to terminate. Here, we also note that the order of the scanning step is defined according to the starting index of the corresponding subsequences. In case the first subsequence ${\mathbfsl y}$ forming a secondary structure, is also the starting of such a subsequence ${\mathbfsl u}$ , the encoder proceeds to type-I replacement.

Extending phase. If the length of the current sequence ${\mathbfsl c}$ is $N_{0}$ where $N_{0}<n$ , the encoder appends a suffix of length $N_{1}=n-N_{0}$ to obtain a sequence of length $n$ . Surprisingly, regardless the choice of the appending suffix, there is an efficient algorithm to decode the source DNA sequence uniquely (refer to the construction of $\textsc{Dec}_{{\rm SSA}}$ ). Here, we present one efficient method to generate a suitable suffix so that the output codeword remains $m$ -SSA.

•

If $N_{1}$ is even, we append ${\mathbfsl s}=({\bf A}{\bf C})^{N_{1}/2}$ to the end of ${\mathbfsl c}$ .

•

If $N_{1}$ is odd, we append ${\mathbfsl s}=({\bf A}{\bf C})^{(N_{1}-1)/2}{\bf A}$ to the end of ${\mathbfsl c}$ .

Theorem 5.

The encoder $\textsc{Enc}_{{\rm SSA}}$ is correct. In other words, $\textsc{Enc}_{{\rm SSA}}({\mathbfsl{x}})$ is an $m$ -SSA sequence of length $n$ for all ${\mathbfsl{x}}\in{\mathcal{D}}^{n-1}$ . The redundancy of $\textsc{Enc}_{{\rm SSA}}$ is one redundant symbol.

Proof.

Suppose that ${\mathbfsl c}=\textsc{Enc}_{{\rm SSA}}({\mathbfsl{x}})\in{\mathcal{D}}^{n}$ , and ${\mathbfsl c}={\mathbfsl c}_{1}{\mathbfsl s}$ , where ${\mathbfsl c}_{1}$ is $m^{\prime}$ -SSA and the length of the repeated patterns of size 2 in ${\mathbfsl c}_{1}$ is of length at most $m^{\prime}=1.5\log n+2$ , and ${\mathbfsl s}$ is the generated suffix of ${\mathbfsl c}_{1}$ at the extending phase. Consider an arbitrary sequence ${\mathbfsl y}$ of length $\ell\geqslant 3\log n+4$ . Suppose that ${\mathbfsl y}={\mathbfsl y}_{1}{\mathbfsl y}_{2}$ , where ${\mathbfsl y}_{1}$ is a subsequence of ${\mathbfsl c}_{1}$ while ${\mathbfsl y}_{2}$ is a subsequence of ${\mathbfsl s}$ . We have the following cases.

•

If ${\mathbfsl y}_{1}$ is of length less than $m^{\prime}$ (particularly including the case ${\mathbfsl y}_{1}\equiv\varnothing$ ), hence the length of ${\mathbfsl y}_{2}$ is more than $m^{\prime}$ . Clearly, there is no subsequence ${\mathbfsl{z}}$ in ${\mathbfsl c}_{1}{\mathbfsl s}$ that ${\mathbfsl y}={\rm RC}({\mathbfsl{z}})$ , as the length of the repeated patterns of size 2 in ${\mathbfsl c}_{1}$ is of length at most $m^{\prime}$ .

•

If ${\mathbfsl y}_{1}$ is of length more than or equal to $m^{\prime}$ , we also conclude that there is no subsequence ${\mathbfsl{z}}$ in ${\mathbfsl c}={\mathbfsl c}_{1}{\mathbfsl{z}}$ that ${\mathbfsl y}={\rm RC}({\mathbfsl{z}})$ since ${\mathbfsl c}_{1}$ is $m^{\prime}$ -SSA. ∎

We now present the corresponding decoding algorithm.

Construction of $\textsc{Dec}_{{\rm SSA}}$ . From a DNA sequence ${\mathbfsl c}$ of length $n$ , the decoder scans from left to right. If the first symbol is ${\bf A}$ , the decoder simply removes ${\bf A}$ and identifies the last $(n-1)$ symbols as the source sequence. On the other hand,

•

if it starts with ${\bf T}$ , the decoder takes the prefix of length $(1+1.5\log n)$ and concludes that this prefix is a pointer prepended after a type-I replacement. In other words, the pointer is of the form ${\bf T}{\mathbfsl p}_{1}{\mathbfsl p}_{2}{\mathbfsl p}_{3}$ , where ${\mathbfsl p}_{1},{\mathbfsl p}_{2},{\mathbfsl p}_{3}$ , each is of length $\log_{4}n=0.5\log n$ . The decoder sets $i,j,k$ to be the positive integers whose DNA-representations are ${\mathbfsl p}_{1},{\mathbfsl p}_{2},{\mathbfsl p}_{3}$ , respectively and sets ${\mathbfsl y}$ to be the subsequence containing the symbols from index $i$ to index $j$ . It removes the pointer, adds ${\mathbfsl{z}}\equiv{\rm RC}({\mathbfsl y})$ to ${\mathbfsl c}$ at index $k$ .

•

if it starts with ${\bf C}$ , the decoder takes the prefix of length $(3+\log n)$ and concludes that this prefix is a pointer prepended after a type-II replacement. In other words, the pointer is of the form ${\bf C}x_{1}x_{2}{\mathbfsl q}_{1}{\mathbfsl q}_{2}$ , where ${\mathbfsl q}_{1},{\mathbfsl q}_{2}$ , each is of length $\log_{4}n=0.5\log n$ . The decoder sets $i,j$ to be the positive integers whose DNA-representations are ${\mathbfsl q}_{1},{\mathbfsl q}_{2}$ , respectively. It then removes the pointer, adds ${\mathbfsl{z}}\equiv(x_{1}x_{2})^{(j-i+1)/2}$ to ${\mathbfsl c}$ at index $i$ .

The decoding procedure terminates when the first symbol is A, and takes the following $(n-1)$ symbols as the user data.

Complexity analysis. For codeword of length $n$ , the time complexity of the encoder (and the corresponding decoder) is linear in $n$ , which follows from: the number of replacing operations is at most $n-m$ , which is $\Theta(n)$ , and the complexity of the each replacing operation (including the prepending prefix step or converting quaternary representation to DNA-representation of an integer) is constant time $\Theta(1)$ .

V Conclusion

We have presented efficient algorithms to construct DNA codes that avoid secondary structure of arbitrary stem length. Particularly, when $m\geqslant 3\log n+4$ , we have provided an efficient encoder that incurs only one redundant symbol, and when $m=3$ , our constructions yield a family of DNA codes of rate $1.3031$ bits/nt, that improve the previous highest code rate in the literature.

Bibliography24

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] L. M. Adleman, “Molecular computation of solutions to combinatorial problems,” Science , vol. 266, pp. 1021-1024, Nov. 1994.
2[2] G. M. Church, Y. Gao, and S. Kosuri, “Next-generation digital information storage in DNA,” Science , vol. 337, no. 6102, pp. 1628-1628, 2012.
3[3] Y. Erlich and D. Zielinski, “DNA fountain enables a robust and efficient storage architecture,” Science , vol. 355, no. 6328, pp. 950-954, 2017.
4[4] L. Organick, S. Ang, Y. J. Chen, R. Lopez, S. Yekhanin, K. Makarychev, M. Racz, G. Kamath, P. Gopalan, B. Nguyen, C. Takahashi, S. Newman, H. Y. Parker, C. Rashtchian, K. Stewart, G. Gupta, R. Carlson, J. Mulligan, D. Carmean, G. Seelig, L. Ceze, and K. Strauss, “Random access in large-scale DNA data storage”, Nature Biotechnology , vol. 36, 242–248, 2018.
5[5] N. Goldman, P. Bertone, S. Chen, C. Dessimoz, E. M. Le Proust, B. Sipos, and E. Birney, “Towards practical, high-capacity, low-maintenance information storage in synthesized DNA,” Nature , vol. 494, 77-80, 2013.
6[6] Y. Benenson, B. Gil, U. Ben-Dor, R. Adar and E. Shapiro, “An autonomous molecular computer for logical control of gene expression,” Nature , vol. 429, pp. 423-429, May 2004.
7[7] S. M. H. T. Yazdi, S. M., R. Gabrys and O. Milenkovic, “Portable and error-free DNA-based data storage,” Scientific reports , 7(1), 1-6, 2017.
8[8] O. Milenkovic and N. Kashyap, “On the design of codes for DNA computing,” in Coding Cryptogr. , Germany: Springer, Mar. 2006, pp. 100-119.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

On the Design of Codes for DNA Computing: Secondary Structure Avoidance Codes

Abstract

I Introduction

II Preliminary

Definition 1**.**

Definition 2**.**

Lemma 1**.**

Definition 3**.**

Lemma 2**.**

Theorem 1** (Benerjee and Banerjee [11]).**

II-A Paper Organisation and Our Main Contribution

III Constructions of (n,D;m)(n,{\mathcal{D}};m)(n,D;m) SSA Codes for arbitrary m>0m>0m>0

III-A Constructions via Block Concatenation

Construction 1**.**

Theorem 2**.**

Proof.

Remark 1**.**

III-B Constructions via Symbol-Composition Constrained Codes

Construction 2** (Symbol-Composition Constrained Codes for m=3m=3m=3, k=1k=1k=1).**

Theorem 3**.**

Proof.

Theorem 4** (Symbol-Composition Constrained Codes for General mmm, k=1k=1k=1).**

Remark 2**.**

IV Constructions of (n,D;m)(n,{\mathcal{D}};m)(n,D;m) SSA Codes for m⩾3log⁡n+4m\geqslant 3\log n+4m⩾3logn+4 with One Redundant Symbol

Definition 4**.**

Example 1**.**

Theorem 5**.**

Proof.

V Conclusion

Definition 1.

Definition 2.

Lemma 1.

Definition 3.

Lemma 2.

Theorem 1 (Benerjee and Banerjee [11]).

III Constructions of $(n,{\mathcal{D}};m)$ SSA Codes for arbitrary $m>0$

Construction 1.

Theorem 2.

Remark 1.

Construction 2 (Symbol-Composition Constrained Codes for $m=3$ , $k=1$ ).

Theorem 3.

Theorem 4 (Symbol-Composition Constrained Codes for General $m$ , $k=1$ ).

Remark 2.

IV Constructions of $(n,{\mathcal{D}};m)$ SSA Codes for $m\geqslant 3\log n+4$ with One Redundant Symbol

Definition 4.

Example 1.

Theorem 5.