Some Enumeration Problems in the Duplication-Loss Model of Genome Rearrangement
Mladen Kova\v{c}evi\'c, Sanja Brdar, Vladimir Crnojevi\'c

TL;DR
This paper explores the mathematical properties of tandem-duplication-random-loss (TDRL) genome rearrangements, providing insights into their structure and potential applications in DNA data storage and error correction.
Contribution
It determines the sizes of TDRL 'balls' and their intersections in the permutation space, advancing the understanding of TDRL operations and their mirror variants.
Findings
Cardinality of TDRL balls of radius one is established.
Maximum intersection size of two TDRL balls is calculated.
Results have implications for DNA data storage and error correction.
Abstract
Tandem-duplication-random-loss (TDRL) is an important genome rearrangement operation studied in evolutionary biology. This paper investigates some of the formal properties of TDRL operations on the symmetric group (the space of permutations over an -set). In particular, the cardinality of `balls' of radius one in the TDRL metric, as well as the cardinality of the maximum intersection of two such balls, are determined. The corresponding problems for the so-called mirror (or palindromic) TDRL rearrangement operations are also solved. The results represent an initial step in the study of error correction and reconstruction problems in this context and are of potential interest in DNA-based data storage applications.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Some Enumeration Problems in the Duplication-Loss Model of Genome Rearrangement
Mladen Kovačević, Sanja Brdar, and Vladimir Crnojević
BioSense Institute, University of Novi Sad, 21000 Novi Sad, Serbia
Emails: {kmladen, brdars, crnojevic}@uns.ac.rs
Abstract
Tandem-duplication-random-loss (TDRL) is an important genome rearrangement operation studied in evolutionary biology. This paper investigates some of the formal properties of TDRL operations on the symmetric group (the space of permutations over an -set). In particular, the cardinality of “balls” of radius one in the TDRL metric, as well as the cardinality of the maximum intersection of two such balls, are determined. The corresponding problems for the so-called mirror (or palindromic) TDRL rearrangement operations are also solved. The results represent an initial step in the study of error correction and reconstruction problems in this context, and are of potential interest in DNA-based data storage applications.
I Introduction
The study of genome rearrangements in evolutionary biology is a rich source of mathematical and algorithmic problems that, apart from their relevance for the field they originated in, are also interesting in their own right [12, 19]. In the present paper, we are concerned with the so-called tandem-duplication-random-loss (TDRL) model of genome rearrangement, which is of importance in the study of gene order evolution in mitochondrial genomes [4, 18]. Specifically, we focus on the combinatorial questions of finding the cardinalities of balls and intersections of balls in this context, questions that are important primarily from a coding theoretic viewpoint, and in particular for error correction and reconstruction problems. Our results are of possible interest in DNA-based data storage applications [21]. Namely, in settings where information is being stored in the form of DNA molecules (or pools thereof), the naturally occurring mutations and rearrangement operations represent the “noise”, and methods of dealing with this noise are therefore essential for reliable data recovery.
Combinatorial problems inspired by the TDRL rearrangement model have been studied previously in several works; see, e.g., [3, 7, 8, 10, 14].
Notation and Terminology
For our purposes, genome can be modeled as a permutation on the set [10]. The set of all permutations over is denoted by . Each permutation is regarded simply as a sequence , where , and thus the elements of will sometimes be referred to as sequences. The identity permutation is denoted by , or by if the length is understood from the context. We say that , where , is a subsequence of length of the sequence .
II TDRL Permutations
A TDRL operation on a sequence is a duplication of the entire sequence , followed by a deletion of one of the two copies of each of the symbols. Thus, each TDRL operation is a permutation of the coordinates of , and the result is another sequence from .
Example 1*.*
An example of a TDRL operation on is the following:
[TABLE]
In (1a), the duplicate of the original sequence is overbraced, and the symbols that are not deleted are underlined.
By definition, the symbols that are deleted from the first copy of are not deleted from the second copy, and vice versa. Therefore, a TDRL operation can be specified by a binary pattern indicating the symbols that are not deleted from the first copy of a given sequence, as illustrated in (1b). We will use this binary representation throughout the paper.
Another way to think of a TDRL operation on is as a partition of into two of its subsequences which are then concatenated. For example, in (1), is partitioned into and , and the final result is .
If a sequence is the result of applying a TDRL operation on , we write , and we define and . The set is illustrated in Table I.
II-A Counting TDRL Operations
Define
[TABLE]
can be thought of as the number of “reversible” TDRL operations – those TDRL operations that can be inverted by another TDRL operation. We first verify that the quantities , , are well-defined in that they do not depend on .
Lemma 1**.**
For all and , \big{|}\boldsymbol{S^{{}^{\rightarrow}}}\!(\pi)\big{|}=\big{|}\boldsymbol{S^{{}^{\rightarrow}}}\!(\pi^{\prime})\big{|}, \big{|}\boldsymbol{S^{{}^{\leftarrow}}}\!(\pi)\big{|}=\big{|}\boldsymbol{S^{{}^{\leftarrow}}}\!(\pi^{\prime})\big{|}, \big{|}\boldsymbol{S^{{}^{\rightarrow}}}\!(\pi)\cap\boldsymbol{S^{{}^{\leftarrow}}}\!(\pi)\big{|}=\big{|}\boldsymbol{S^{{}^{\rightarrow}}}\!(\pi^{\prime})\cap\boldsymbol{S^{{}^{\leftarrow}}}\!(\pi^{\prime})\big{|}.
Proof:
A bijection between, e.g., and , is constructed simply by relabeling the symbols in in such a way that is transformed into . More precisely, take such that , and notice that if and only if . ∎
Theorem 2**.**
.
Proof:
Since the sequences in are determined by binary patterns of length , the inequality \big{|}\boldsymbol{S^{{}^{\rightarrow}}}\!(\pi^{\textnormal{id}})\big{|}\leq 2^{n} is straightforward. However, notice that the binary patterns of the form , , all produce the same sequence— itself—so we in fact have \big{|}\boldsymbol{S^{{}^{\rightarrow}}}\!(\pi^{\textnormal{id}})\big{|}\leq 2^{n}-n. To demonstrate that this upper bound is tight, one would need to show that all other binary patterns produce different sequences. This fact is rather obvious (see Example 1) so we omit a formal proof.
Even though the fact that follows from the same relabeling argument used in the proof of Lemma 1, we give here an alternative derivation that is useful for understanding the structure of reverse TDRL operations. It follows from the definition of TDRL operations that the sequences that can produce are those that can be partitioned into subsequences and , for some . For , there are exactly sequences that can be partitioned into subsequences , , but cannot be partitioned into subsequences , (the latter condition is needed to avoid double-counting). Namely, the number of sequences that can be partitioned into subsequences , is the number of ways to choose the positions for the elements of the subsequence , which is , and among those sequences there is only one, , which can also be partitioned into , . Therefore, \big{|}\boldsymbol{S^{{}^{\leftarrow}}}\!(\pi^{\textnormal{id}})\big{|}=1+\sum_{j=0}^{n-1}\big{(}\binom{n}{j}-1\big{)}=2^{n}-n. ∎
We note that the identity also easily follows from [10, Thm 1.1] and [8, Thm 6].
In the following statement we obtain an expression for the number of reversible TDRL operations, or equivalently, for the number of sequences that can both produce and be produced by it.
Theorem 3**.**
.
Proof:
We first argue that a TDRL operation is reversible if and only if the corresponding binary pattern is of the form , where are non-negative integers summing to . In words, the requirement is that has at most two blocks of ones, and if it has exactly two blocks, then one of them is the leading block. For the direct part, notice that a TDRL operation is reversible by the TDRL operation . Conversely, if a TDRL operation is not of the form , then its binary pattern can be written as , where are strictly positive integers and and are arbitrary (possibly empty) binary strings. Such a TDRL operation produces a sequence that cannot be partitioned into subsequences , and is therefore not reversible.
Now that we have a characterization of reversible TDRL operations, we can use it to show the desired expression. There is one binary pattern containing no ’s, and there are binary patterns containing exactly one block of ’s (a block is determined by its delimiters). Among the latter, there are patterns for which this block is the leading block, i.e., patterns of the form , . As we already know, such patterns correspond to the same TDRL operation as the pattern , while all the other patterns correspond to different TDRL operations. Therefore, there are exactly different TDRL operations corresponding to binary patterns with at most one block of ’s. Finally, there are binary patterns with exactly two blocks of ’s, one of which is the leading block (choose the length of the leading block and then choose the delimiters of the second block), and all of them correspond to different TDRL operations. ∎
Thus, only an asymptotically vanishing fraction of TDRL operations are reversible, .
II-B The Reconstruction Problem
The sequence reconstruction problem, as introduced by Levenshtein [16], is defined as follows: a sequence is transmitted through a noisy channel multiple times, and the receiver is required to reconstruct it after it has collected sufficiently many noisy observations. The question is how many different noisy versions of the sequence are sufficient in order to guarantee successful and unambiguous reconstruction. In combinatorial terms the problem can be rephrased as follows: what is the cardinality of the largest possible intersection of sets of channel outputs that two different sequences of length can produce? Denoting the cardinality of the mentioned largest intersection by , one easily concludes that the number of noisy observations that guarantees successful reconstruction in all cases is . The problem of determining the largest intersection of two “balls” in a given space is therefore relevant in all situations where one uses a simple repetition scheme to communicate reliably. As argued in [22], this problem naturally arises in DNA-based data storage applications.
In the present context, the “noise” are the TDRL rearrangement operations and the reconstruction problem reduces to the following: what is the largest possible cardinality of the set ? So define
[TABLE]
In the following statement we give a solution to the reconstruction problem just described. For other relevant works on the reconstruction problem for translocation/permutation errors, see, e.g., [15, 17, 20].
Theorem 4**.**
.
Proof:
Consider the sequence obtained from by moving the first symbol to the last position (a cyclic shift). Consider some , and suppose that the binary pattern corresponding to the TDRL operation ends in a , i.e., is of the form for . Then it is easy to see that can also be obtained from via the TDRL operation , and hence . Since there are binary strings of the form , and since all of them result in different sequences , we have just shown that \big{|}\boldsymbol{S^{{}^{\rightarrow}}}\!(\pi^{\textnormal{id}})\cap\boldsymbol{S^{{}^{\rightarrow}}}\!(\pi)\big{|}\geq 2^{n-1}, and therefore .
We now use induction to prove that \big{|}\boldsymbol{S^{{}^{\rightarrow}}}\!(\pi^{\textnormal{id}})\cap\boldsymbol{S^{{}^{\rightarrow}}}\!(\pi)\big{|}\leq 2^{n-1} for every and every . Suppose that, for a given , there is a sequence such that \big{|}\boldsymbol{S^{{}^{\rightarrow}}}\!(\pi^{\textnormal{id}})\cap\boldsymbol{S^{{}^{\rightarrow}}}\!(\pi)\big{|}>2^{n-1}. This implies that there are at least binary patterns describing TDRL operations such that . If , denote and , and suppose that (if not, choose another index for which this holds). (By possibly renaming the symbols, both and can be thought of as sequences/permutations over , in which case would be the identity permutation.) By deleting the ’th bit of each of the mentioned binary patterns, one would get at least different binary patterns of length . Notice that these binary patterns describe TDRL operations on the sequence , and that every sequence that is the result of such an operation can be produced by as well, i.e., . (If a binary pattern describes a TDRL operation that produces a sequence in the intersection , then it is not difficult to see that the pattern describes a TDRL operation that produces a sequence in the intersection .) We have thus shown that the assumption implies that . In other words, assuming implies , and since one can directly verify that , the inductive proof that for every is complete. ∎
As exemplified in the previous proof, the intersection is of maximum possible cardinality when are cyclic shifts (by one position) of one another. This is also the case for any two sequences that differ by one adjacent transposition, e.g., , .
Corollary 5**.**
Let . Every sequence is uniquely determined by any elements of .
Proof:
We just have to verify that \big{|}\boldsymbol{S^{{}^{\rightarrow}}}\!(\pi)\big{|}=2^{n}-n\geq 2^{n-1}+1=N(n)+1 for . ∎
II-C Bounded TDRL Permutations
In this subsection we analyze a more general model where a TDRL rearrangement operation is confined to segments of width within the original sequence [8]. In other words, a TDRL operation is in this case applied on a segment of consecutive symbols of a given sequence , while the remaining symbols of are left intact.
Example 2*.*
One possible TDRL operation on , applied on the segment of length , is the following:
[TABLE]
where the duplicate segment is overbraced, and the symbols that are not deleted (from the original segment and its duplicate) are underlined.
In the special case , the only non-trivial TDRL operations are adjacent transpositions, i.e., swaps of two adjacent symbols.
Let be the number of sequences that can be obtained from by applying a TDRL operation on an arbitrary segment of consisting of consecutive symbols, and define and accordingly (see (2)). The same argument that was used in the proof of Lemma 1 can be used in this context as well, implying that .
Theorem 6**.**
.
Proof:
Consider first the sequences that can be produced from by applying a TDRL operation on its first symbols. We know by Theorem 2 that the number of such sequences is . Now consider the second “window” of length containing the symbols . There are again different sequences we can get by applying a TDRL operation on this window; however, some of them are identical to sequences that were obtained in the first step. Namely, all sequences that can be produced by a TDRL operation on the intersection of the two windows, i.e., on the symbols , are double-counted in this way,
[TABLE]
The number of sequences that have been double-counted—those that can be produced by a TDRL operation on the segment —is . We then proceed to find as follows: count the sequences that can be produced by a TDRL operation on but cannot be produced by a TDRL operation on (the latter will be counted in the second window); then add the number of sequences that can be produced by a TDRL operation on but cannot be produced by a TDRL operation on ; etc. This is done for the first windows. For the last, ’th window there is no need exclude any sequences because the procedure stops and there is no double-counting. We thus get , which is what we needed to show. ∎
As an application of Theorem 6, we next state a sphere-packing bound for codes in correcting one “TDRL error” of length . Namely, let be a set of sequences with the property that every sequence from can be uniquely recovered even after a TDRL operation of length has been applied on it. Then, by Theorem 6 and a simple sphere-packing argument, we conclude that the cardinality of any such code is upper-bounded as:
[TABLE]
For we have , and the above sphere-packing bound reduces to . We note that error-correcting codes in with respect to various error/rearrangement models have been extensively studied in the literature; see, e.g., [1, 9, 11, 13] and the references therein.
Theorem 7**.**
.
Proof:
The statement follows from Theorem 3 and the inclusion-exclusion method of counting that was used in the proof of Theorem 6 as well. ∎
Note that Theorems 2, 3 are recovered from Theorems 6, 7 for .
III Mirror-TDRL Permutations
A mirror (or palindromic) TDRL operation—MTDRL operation for short—on a sequence is a duplication of the sequence , followed by a reversal of the second copy, and by a deletion of one of the two copies of each of the individual symbols [2].
Example 3*.*
An example of a MTDRL operation on is the following:
[TABLE]
where the reversed copy of the original sequence is overbraced, and the symbols that are not deleted are underlined.
The set of sequences resulting from applying a MTDRL operation on is illustrated in Table II.
III-A Counting MTDRL Operations
The quantities , , in this setting are defined similarly to (2). The fact that is established by the same reasoning as in Lemma 1.
Theorem 8**.**
.
Proof:
The binary patterns and , for , always produce the same sequence. This follows from the definition of MTDRL operations (6) (see also Table II). Furthermore, all patterns , , produce different sequences. Hence, . ∎
We note that Theorem 8 can also be inferred from the characterization of the set of sequences obtained in [2, Lem. 2 and Cor. 1]. The following statement gives the number of reversible MTDRL operations.
Theorem 9**.**
.
Proof:
We need to count all sequences that can both produce and be produced by it in a single MTDRL operation. First notice that all sequences in are unimodular (first increasing, then decreasing). This follows from the definition of MTDRL operations – each such operation can be seen as selecting a (necessarily increasing) subsequence of in the first step, and then reading off the remaining subsequence in reverse order. Now, if a sequence ends with , it is possible to produce from it only via the pattern (because starts with ), which implies that . If a sequence ends with , then it has to start with because it is unimodular, as we have noted above. It is possible to produce from such a sequence only via the pattern (because starts with ), which implies that . Continuing in this way, one concludes that there is exactly one sequence that ends with , , and that can produce . Therefore, the number of reversible MTDRL operations is . ∎
III-B The Reconstruction Problem
We next determine the maximum cardinality of the intersections , pertaining to the reconstruction problem as defined in Section II-B. Let
[TABLE]
As it turns out, for all , and therefore unambiguous reconstruction is in general impossible for MTDRL operations.
Theorem 10**.**
.
Proof:
Consider the sequence obtained from by swapping its last two elements. Recall that every sequence in can be obtained from via a MTDRL operation whose binary pattern is of the form , . Furthermore, it can be easily checked that via if and only if via , where . Likewise, via if and only if via . This shows that every that belongs to also belongs to , and thus \big{|}\boldsymbol{S_{\textsc{m}}^{{}^{\rightarrow}}}\!(\pi^{\textnormal{id}})\cap\boldsymbol{S_{\textsc{m}}^{{}^{\rightarrow}}}\!(\pi)\big{|}=\big{|}\boldsymbol{S_{\textsc{m}}^{{}^{\rightarrow}}}\!(\pi^{\textnormal{id}})\big{|}=2^{n-1}. ∎
III-C Bounded MTDRL Permutations
Consider now a more general model where MTDRL rearrangement operations are confined to segments of width within the original sequence (see Section II-C), and define , , accordingly.
Theorem 11**.**
.
Proof:
We use the inclusion-exclusion counting method for “sliding window” of width , as for the TDRL model (see the proof of Theorem 6). The main question is how many sequences need to be excluded for a given window in order to avoid double-counting? It turns out that the situation for MTDRL is simpler than for TDRL, and only one sequence needs to excluded – the identity permutation. Namely, any non-trivial MTDRL operation on the window results in the last symbol () being moved to one the preceding positions (see (6)), and applying a MTDRL operation to the window clearly leaves the symbol intact. Therefore, only one sequence— itself—can be produced by both a MTDRL operation on the segment and a MTDRL operation on the segment of . By using this fact and Theorem 8, we get . ∎
If is a code that is able to recover from one MTDRL operation of length , then, by Theorem 11 and a simple sphere-packing argument, we obtain the following bound on its cardinality:
[TABLE]
Theorem 12**.**
.
Proof:
Follows from Theorem 9 after applying the same method of counting as in the proof of Theorem 11. ∎
Acknowledgment
This work was supported by the European Commission (H2020 Antares project, ref. no. 739570).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] A. Barg and A. Mazumdar, “Codes in Permutations and Error Correction for Rank Modulation,” IEEE Trans. Inf. Theory , vol. 56, no. 7, pp. 3158–3165, 2010.
- 2[2] J.-L. Baril and R. Vernay, “Whole Mirror Duplication-Random Loss Model and Pattern Avoiding Permutations,” Inf. Process. Lett. , vol. 110, no. 11, pp. 474–480, 2010.
- 3[3] M. Bernt, K.-Y. Chen, M.-C. Chen, A.-C. Chu, D. Merkle, H.-L. Wang, K.-M. Chao, M. Middendorf, “Finding All Sorting Tandem Duplication Random Loss Operations,” J. Discrete Algorithms , vol. 9, no. 1, pp. 32–48, 2011.
- 4[4] M. Bernt and M. Middendorf, “A Method for Computing an Inventory of Metazoan Mitochondrial Gene Order Rearrangements,” BMC Bioinformatics , vol. 12, Suppl 9, p. S 6, 2011.
- 5[5] I. F. Blake, G. Cohen, and M. Deza, “Coding with Permutations,” Inf. Control , vol. 43, no. 1, pp. 1–19, 1979.
- 6[6] M. Bóna, Combinatorics of Permutations , Chapman & Hall/CRC Press, 2004.
- 7[7] M. Bouvel and E. Pergola, “Posets and Permutations in the Duplication–Loss Model: Minimal Permutations with d 𝑑 d Descents,” Theor. Comput. Sci. , vol. 411, no. 26–28, pp. 2487–2501, 2010.
- 8[8] M. Bouvel and D. Rossin, “A Variant of the Tandem Duplication – Random Loss Model of Genome Rearrangement,” Theor. Comput. Sci. , vol. 410, no. 8–10, pp. 847–858, 2009.
