Reconstructing rearrangement phylogenies of natural genomes

Leonard Bohnenkämper; Jens Stoye; Daniel Doerr

PMC · DOI:10.1186/s13015-025-00279-5·June 7, 2025

Reconstructing rearrangement phylogenies of natural genomes

Leonard Bohnenkämper, Jens Stoye, Daniel Doerr

PDF

Open Access

TL;DR

This paper presents an improved method for reconstructing ancestral genomes using a computational model that handles complex genome rearrangements and chromosomal structures.

Contribution

A highly optimized ILP approach for solving the Small Parsimony Problem under the DCJ-indel model, with improved handling of chromosomal structures.

Findings

01

The optimized ILP method shows significant performance improvements on simulated phylogenies with linear chromosomes.

02

The method outperforms previous approaches even when the true chromosomal structure is circular.

03

Practical benefits are demonstrated in an analysis of seven Anopheles species.

Abstract

We study the classical problem of inferring ancestral genomes from a set of extant genomes under a given phylogeny, known as the Small Parsimony Problem (SPP). Genomes are represented as sequences of oriented markers, organized in one or more linear or circular chromosomes. Any marker may appear in several copies, without restriction on orientation or genomic location, known as the natural genomes model. Evolutionary events along the branches of the phylogeny encompass large scale rearrangements, including segmental inversions, translocations, gain and loss (DCJ-indel model). Even under simpler rearrangement models, such as the classical breakpoint model without duplicates, the SPP is computationally intractable. Nevertheless, the SPP for natural genomes under the DCJ-indel model has been studied recently, with limited success. Building on prior work, we present a highly optimized ILP…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species2

Anopheles Anopheles(series)

Figures12

Click any figure to enlarge with its caption.

Gaps reported by gurobi with increasing solving times for SPP-DCJ and variants of SPP-DCJ-v2 until a time limit of 720 minutes. Right: Zoomed in on the first 25 minutes. NN - no additional pre-processing, IN - initial solution precomputed, IB - initial solution and lower bounds precomputed, IBF - initial solution and lower bounds precomputed, with variable ancestral family sizes

A genome of five markers $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1_1$$\end{document}$ , $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1_2$$\end{document}$ , $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{was

Capping-Free Multi-Relational Diagram for two genomes on an unresolved homology ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\equiv _1$$\end{document}$ ) with families $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{1_1,1_2,1_3

Acrshort*cfmrd for the two genomes of Fig. [2](#Fig2) on a resolved homology ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\equiv _2$$\end{document}$ ) with families $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{1_1,1_3\}$$\en

Left: A degenerate genome. Right: A linearization of it

Left: This degenerate genome is not linearizable because of missing telomeres. Right: The genome becomes linearizable when adding telomeres. One linearization is that of Fig. [4](#Fig4)

Solving times for SPP-DCJ and SPP-DCJ-v2 on simulated genomes with increasing numbers of telomeres. Solid lines represent corresponding median values

Solving times for SPP-DCJ and SPP-DCJ-v2 on genomes generated by ZOMBI on a range of trees with increasing branch lengths with ancestral adjacencies inferred by DecoSTAR. Solid lines represent corresponding median values

Mean precision, recall and F1 score for default and safer linearization mode for varying tree scales. Transparent ranges indicate minimum to maximum range of the five tested samples per step

Average pre-processing and solving times of 50 samples for variants of SPP-DCJ-v2. NN - no additional pre-processing, IN - initial solution precomputed, IB - initial solution and lower bounds precomputed

Funding1

—Universitätsklinikum Düsseldorf. Anstalt öffentlichen Rechts (8911)

Keywords

Genome rearrangementAncestral reconstructionSmall parsimonyInteger linear programmingDouble-cut-and-join

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenome Rearrangement Algorithms · Genomics and Phylogenetic Studies · Chromosomal and Genetic Variations

Full text

Introduction

The Small Parsimony Problem (SPP) is a general optimization problem in phylogenetics that aims at annotating the internal vertices of a given phylogenetic tree $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T = (V,E)$$\end{document}$ whose leaves are already annotated, such that the total tree distance $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d_T = \sum _{(A,B) \in E} d(A,B)$$\end{document}$ is minimized. Here, d(A, B) is a function returning the distance between the annotations of any two vertices A and B of the phylogenetic tree. Traditional tree annotations may be DNA or protein sequences, while more recently, with the emergence of phylogenomic studies, also complete genomes, often in form of so-called marker sequences may be used.

Distance functions for marker sequences usually consider rearrangements and content-modifying operations on the elements of the sequences. A useful general-purpose distance in genome rearrangement is based on the DCJ-indel model. Conceived by Braga et al. [1] as an extension of the Double-Cut-and-Join model by Yancopoulos et al. [2], operations in the DCJ-indel model are either genomic rearrangements, modeled by a double cut and subsequent joining of the so created ends (DCJ), or segmental gains and losses of arbitrary length (indels).

When each marker occurs not more than once per genome, calculating the DCJ-indel distance between two genomes is polynomial [1]. However, on genomes with unrestricted distributions of markers, also called natural genomes, calculating the DCJ-indel distance is NP-hard. Nonetheless, efficient ILP solutions exist, such as ding [3].

The first attempt to generalize this method from the pairwise genomic distance to the phylogenomic SPP under the DCJ-indel model was an ILP by Doerr and Chauve [4], called SPP-DCJ. They did so by solving a generalized problem, in which – as a result of some pre-processing – adjacencies in ancestral genomes could be absent or present, and in the latter case they may be assigned a weight that would be taken into consideration during optimization. One major issue in this generalization was ding’s use of caps, telomeric markers that need to be matched during optimization and for which the solution space is superexponential [5]. Doerr and Chauve went to great lengths to limit the effect of this additional solution space, but were ultimately not able to completely remove it from their solution.

The ILP solution presented in this manuscript combines a recent reformulation of the DCJ-indel model that allows one to forego the matching of caps [6] with the basic modeling of SPP pioneered by SPP-DCJ. We additionally resolve another issue described in [4], which is the dependence of SPP-DCJ on previously known candidates for circular singletons, for each of which SPP-DCJ creates a number of constraints and variables. Since the number of circular singleton candidates in the worst case is exponential in the number of non-telomeric extremities, the worst case size of SPP-DCJ is exponential as well. While this problem may be less relevant when given few, refined candidate adjacencies for ancestors, our ILP is the first to solve the SPP for natural genomes under the DCJ-indel model while remaining of polynomial size w.r.t. any input data.

In practice, SPP, also known as small phylogeny problem, is central to many methods for ancestral genome reconstruction [7]. For instance, SPP-DCJ [4] is part of the AGO framework [8]. Other methods, such as GASTS [9] and MGRA [10] approach SPP by iteratively constructing median genomes. The genome median problem asks to construct one ancestral genome to $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n \ge 3$$\end{document}$ given genomes, a nevertheless NP-hard problem for which these and most other methods resort to heuristic or approximate solutions [9–11]. Algorithmic innovations based on ILPs [3, 6, 12] made it possible to compute exact solutions in practical applications. For instance, Frolova et al. [13] employ DING [3] in the calculation of pairwise DCJ indel distances to study phylogenetic relationships of pathogenic plasmids.

The remainder of the manuscript is organized as follows. In Section "Preliminaries", we give basic definitions and previous results needed to derive our algorithm. In Section "A new method", we explain the fundamental features of our method (Subsections "Capping-free model" and "On linearizability") before presenting the ILP in Subsection "A new ILP formulation" and detailing further methods of pre-processing to tighten the solution space in Subsection "Pre-processing". We evaluate the performance of our method in Section "Evaluation" and discuss our overall findings in Section "Discussion".

Preliminaries

For the purposes of this work, we use the abstraction to describe genomes as sequences of oriented markers. A (genomic) marker $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g = (g^\text {t}, g^\text {h})$$\end{document}$ is a universally unique entity consisting of marker extremities tail of g, denoted by $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g^\text {t}$$\end{document}$ , and head of g, denoted by $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g^\text {h}$$\end{document}$ .

The structure of a genome can be described via its adjacencies. An adjacency $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{f^x,g^y\}$$\end{document}$ (with $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x,y \in \{\text {t},\text {h}\}$$\end{document}$ ) describes that markers f and g are neighbors on the same chromosome and oriented, such that extremities $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f^x$$\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g^y$$\end{document}$ are adjacent. For ease of notation we also write $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f^xg^y$$\end{document}$ for an adjacency. Note that adjacencies can be read in either direction, i.e. $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g^yf^x$$\end{document}$ is the same as $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f^xg^y$$\end{document}$ .

For the sake of a simpler formulation of the theory, we aim for each extremity to be part of some adjacency. In order to accomplish this, we use additional extremities modeling the ends of linear chromosomes, called telomeres. A telomere $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$t^\circ$$\end{document}$ is a universally unique entity encompassing a single telomeric extremity denoted by “ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\circ$$\end{document}$ ”. A genome can then be described as a graph as follows.

Definition 1

A genome $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {A}$$\end{document}$ is a graph with vertices $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {E}(\mathbb {A})\cup \mathcal {T}(\mathbb {A})$$\end{document}$ , namely its marker extremities $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {E}(\mathbb {A})$$\end{document}$ and telomeric extremities $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {T}(\mathbb {A})$$\end{document}$ . The set of edges is $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {M}(\mathbb {A})\cup \mathcal {A}(\mathbb {A})$$\end{document}$ , namely its marker edges $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {M}(\mathbb {A})$$\end{document}$ and adjacency edges $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {A}(\mathbb {A})$$\end{document}$ . This graph fulfills the following properties:

$\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {M}(\mathbb {A})$$\end{document}$ is a perfect matching on $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {E}(\mathbb {A})$$\end{document}$ with $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {M}(\mathbb {A}) = \{\{m^t,m^h\} \mid \forall m^t,m^h\in \mathcal {E}(\mathbb {A})\}$$\end{document}$ , and
$\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {A}(\mathbb {A})$$\end{document}$ is a perfect matching on $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {E}(\mathbb {A})\cup \mathcal {T}(\mathbb {A})$$\end{document}$ .

An example of a genome is given in Fig. 1.Fig. 1A genome of five markers $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1_1$$\end{document}$ , $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1_2$$\end{document}$ , $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$2_1$$\end{document}$ , $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$3_1$$\end{document}$ , $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$4_1$$\end{document}$ on a single linear chromosome

Because each marker is universally unique, in order to compare genomes we need to establish which markers are homologous. We model homology as an equivalence relation ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\equiv$$\end{document}$ ), that is $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m_a\equiv m_b$$\end{document}$ for some markers $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m_a\in \mathcal {M}(\mathbb {A})$$\end{document}$ , $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m_b\in \mathcal {M}(\mathbb {B})$$\end{document}$ and genomes $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {A},\mathbb {B}$$\end{document}$ . Note that this includes the case $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {A}=\mathbb {B}$$\end{document}$ , i.e. there can be homologous markers in the same genome (in-paralogs). The equivalence class of a marker m, denoted by [m], is called its family. If a marker m exists in $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {A}$$\end{document}$ , but has no equivalent in $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {B}$$\end{document}$ or vice versa, we refer to m as singular w.r.t. $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {A},\mathbb {B}$$\end{document}$ .

Given the equivalence relation on markers, one can easily derive an equivalence relation on extremities, namely $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m_a^\text {t}\equiv m_b^\text {t}$$\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m_a^\text {h}\equiv m_b^\text {h}$$\end{document}$ if and only if $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m_a\equiv m_b$$\end{document}$ . For this derived equivalence we have $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m_a^\text {h}\not \equiv m_b^\text {t}$$\end{document}$ for all $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m_a,m_b$$\end{document}$ . We call extremities singular if and only if their corresponding marker is singular. One can visualize such an equivalence relation for two genomes $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {A},\mathbb {B}$$\end{document}$ using the Capping-Free Multi-Relational Diagram as defined in Definition 2.

Definition 2

Given two genomes $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {A},\mathbb {B}$$\end{document}$ and a homology ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\equiv$$\end{document}$ ), the Capping-Free Multi-Relational Diagram (CFMRD) is a graph $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {CFMRD}(\mathbb {A},\mathbb {B},\equiv )=(\mathcal {E}\cup \mathcal {T},{E}_\text {adj}\cup {E}_\text {self}\cup {E}_{\text {ext}})$$\end{document}$ with $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {E}=\mathcal {E}(\mathbb {A})\cup \mathcal {E}(\mathbb {B})$$\end{document}$ , $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {T}=\mathcal {T}(\mathbb {A})\cup \mathcal {T}(\mathbb {B})$$\end{document}$ , adjacency edges $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${E}_\text {adj}=\mathcal {A}(\mathbb {A})\cup \mathcal {A}(\mathbb {B})$$\end{document}$ , self edges $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${E}_\text {self}=\{m \in \mathcal {M}(\mathbb {A}) \cup \mathcal {M}(\mathbb {B}) \mid m \text { singular w.r.t.\ }\mathbb {A},\mathbb {B}\}$$\end{document}$ and extremity edges $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${E}_{\text {ext}}= \{ \{u,v\} \mid u \in \mathcal {E}(\mathbb {A}), v \in \mathcal {E}(\mathbb {B}), u\equiv v\}$$\end{document}$ .

An example of a genome is given in Fig. 2.Fig. 2. Capping-Free Multi-Relational Diagram for two genomes on an unresolved homology ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\equiv _1$$\end{document}$ ) with families $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{1_1,1_2,1_3,1_4\},$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{2_1,2_2\},\{3_1,3_2\},\{4_1\},\{5_1\}.$$\end{document}$

An established way to compare two genomes on a structural level is the rearrangement distance. The rearrangement distance of two genomes $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {A},\mathbb {B}$$\end{document}$ is defined as the minimum number of operations needed to transform $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {A}$$\end{document}$ into $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {B}$$\end{document}$ with operations restricted to a certain model (such as DCJ-indel). When ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\equiv$$\end{document}$ ) maps each marker of genome $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {A}$$\end{document}$ to at most one marker of genome $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {B}$$\end{document}$ , calculating the rearrangement distance between $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {A}$$\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {B}$$\end{document}$ is typically easy. We refer to such a homology as resolved. More formally, a homology is resolved if for each genome $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {A}$$\end{document}$ and marker $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m\in \mathcal {M}(\mathbb {A})$$\end{document}$ the family of m contains only itself, i.e. $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$[m]\cap \mathcal {M}(\mathbb {A})=\{m\}$$\end{document}$ . On these homologies, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {CFMRD}(\mathbb {A},\mathbb {B},\equiv )$$\end{document}$ consists only of simple cycles and simple paths. An example of a CFMRD on a resolved homology is shown in Fig. 3.Fig. 3. Acrshort*cfmrd for the two genomes of Fig. 2 on a resolved homology ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\equiv _2$$\end{document}$ ) with families $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{1_1,1_3\}$$\end{document}$ , $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{1_2,1_4\}$$\end{document}$ , $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{2_1,2_2\}$$\end{document}$ , $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{3_1,3_2\}$$\end{document}$ , $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{4_1\}$$\end{document}$ , $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{5_1\}$$\end{document}$ . Note that ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\equiv _2$$\end{document}$ ) is a matching on ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\equiv _1$$\end{document}$ )

With a resolved homology, the DCJ-indel distance can be calculated easily by just counting different types of components in the CFMRD. For the purpose of this counting, we ignore self edges. We write c for the number of cycles and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_{ab}$$\end{document}$ (resp. $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_{aa}$$\end{document}$ , resp. $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_{bb}$$\end{document}$ ) for the number of paths that start in $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {A}$$\end{document}$ and end in $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {B}$$\end{document}$ (resp. start in $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {A}$$\end{document}$ and end in $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {A}$$\end{document}$ , resp. start in $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {B}$$\end{document}$ and end in $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {B}$$\end{document}$ ). Since the graph is undirected, we canonize their labels by reading paths from $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {A}$$\end{document}$ to $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {B}$$\end{document}$ . When the vertex the path starts or ends in is a telomere of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {A}$$\end{document}$ (resp. $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {B}$$\end{document}$ ), we write A (resp. B) in uppercase. When the path ends because the only way to continue it would be a self edge (note that we ignore self edges here), we write a (resp. b) in lowercase. When a path starts and ends in the same genome, we read it from telomere to singular extremity (note that in all other cases, the label is symmetric).

For example, the CFMRD of Fig. 3 has $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c=2$$\end{document}$ , $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_{AB}=1$$\end{document}$ (path $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$t^\circ ,1_1^\text {t}, 1_3^\text {t},t^\circ$$\end{document}$ ), $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_{ab}=1$$\end{document}$ (path $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$4_1^\text {t},2_1^\text {h},2_2^\text {h},5_1^\text {t}$$\end{document}$ ), $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_{aB}=1$$\end{document}$ (path $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$4_1^\text {h},3_1^\text {h},3_2^\text {h},t^\circ$$\end{document}$ ) and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_{Ab}=1$$\end{document}$ (path $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$t^\circ ,3_1^\text {t},3_2^\text {t},5_1^\text {h}$$\end{document}$ ).

There is one case, in which we need to consider self edges, namely circular singletons. Circular singletons are cycles that consist only of adjacency and self edges. We denote their number by s. For a more in-depth explanation of these terms, the interested reader is referred to [6]. Using these terms, the following formula can be used.

Theorem 1

(adapted from [6]) For two genomes $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {A},\mathbb {B}$$\end{document}$ and a resolved homology ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\mathop {\equiv }\limits ^{\star }}}$$\end{document}$ ), the DCJ-indel distance is

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \bar{d}_\mathrm {DCJ-ID}(\mathbb {A}, \mathbb {B}, {{\mathop {\equiv }\limits ^{\star }}}) \, = \, n -c + \left\lceil \frac{p_{a b} + \max (p_{A a},p_{a B}) + \max (p_{A b},p_{B b}) - p_{A B}}{2}\right\rceil + s \end{aligned}$$\end{document}

with n the number of matched markers, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n = |\{(m_a,m_b)\in \mathcal {M}(\mathbb {A})\times \mathcal {M}(\mathbb {B}) \mid m_a{{\mathop {\equiv }\limits ^{\star }}}m_b\}|$$\end{document}$ .

This formula holds because it is equivalent to previously proven distance formulas under the DCJ-indel model, however it can also be derived independently. Details are explained in [6]. To paraphrase the results there, it is shown that two genomes are equal if and only if their CFMRD consists of only c cycles and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_{AB}$$\end{document}$ paths between telomeres of both genomes with $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=c+\frac{p_{AB}}{2}$$\end{document}$ . Additionally, for each DCJ or indel operation the formula of Theorem 1 changes by at most 1. These two facts combined yield the formula as a lower bound. Additionally [6] contains an algorithm transforming $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {A}$$\end{document}$ into $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {B}$$\end{document}$ using DCJ and indel operations that is able to reach this lower bound, proving it is a formula for the rearrangement distance under the DCJ-indel model.

When the homology is not resolved, we need to refine the homology to be resolved. We call such a refinement a matching. More formally, a matching ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\mathop {\equiv }\limits ^{\star }}}$$\end{document}$ ) on ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\equiv$$\end{document}$ ) is a resolved homology, such that $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m_a{{\mathop {\equiv }\limits ^{\star }}}m_b\implies m_a\equiv m_b$$\end{document}$ .

Since allowing for arbitrary matchings can lead to an excess of indels in the sorting scenario, we restrict ourselves to the maximum matching model. A matching $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$({{\mathop {\equiv }\limits ^{+}}})$$\end{document}$ is maximum w.r.t. $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {A},\mathbb {B}$$\end{document}$ if a maximum amount of markers in $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {A}$$\end{document}$ has a homolog in $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {B}$$\end{document}$ and vice versa.

Definition 3

Given homology ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\equiv$$\end{document}$ ), the DCJ-indel distance between $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {A}$$\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {B}$$\end{document}$ under the maximum matching model is

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} d_{\mathrm {DCJ-ID}}(\mathbb {A},\mathbb {B},\equiv ) \, = \, \min _{({{\mathop {\equiv }\limits ^{+}}}) \mathrm {\ maximum\ matching\ on\ } (\equiv )} \bar{d}_\mathrm {DCJ-ID}(\mathbb {A},\mathbb {B},{{\mathop {\equiv }\limits ^{+}}}). \end{aligned}$$\end{document}

When reconstructing a phylogeny, only extant genomes are known, that is, there is no definitive information about the adjacencies at the inner nodes. In order to capture this uncertainty, a typical approach is to generate a large set of candidate adjacencies at each inner node that very likely will include the correct ones. Such a set can be viewed as a degenerate genome, which however may contain multiple conflicting adjacencies, such as ab and ac with $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$b\ne c$$\end{document}$ . (In a normal genome this cannot occur, as the matching requirement ensures that there is only one adjacency that involves a.) More formally, a degenerate genome $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {D}$$\end{document}$ is a graph $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\mathcal {E}(\mathbb {D})\cup \mathcal {T}(\mathbb {D}), \mathcal {M}(\mathbb {D})\cup \mathcal {A}(\mathbb {D}))$$\end{document}$ that fulfills only Property 1 of Definition 1.

All possible ancestors at a certain node in the phylogeny are then built from disambiguations of these conflicting adjacencies. We call these possible ancestors linearizations. A linearization of a degenerate genome $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {D}$$\end{document}$ is a genome $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {A}$$\end{document}$ , such that $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {E}(\mathbb {A})=\mathcal {E}(\mathbb {D})$$\end{document}$ , $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {T}(\mathbb {A}) \subseteq \mathcal {T}(\mathbb {D})$$\end{document}$ , $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {M}(\mathbb {A})=\mathcal {M}(\mathbb {D})$$\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {A}(\mathbb {A})\subseteq \mathcal {A}(\mathbb {D})$$\end{document}$ . If such a linearization exists, we call $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {D}$$\end{document}$ linearizable. We give an example of a linearizable degenerate genome and one of its linearizations in Fig. 4. Note that each genome is also a degenerate genome with precisely one linearization, namely itself.Fig. 4. Left: A degenerate genome. Right: A linearization of it

We can then formulate the problem we are considering in this paper as finding linearizations of all (degenerate) genomes in the phylogeny, such that the sum of all DCJ-indel distances in the tree is minimized. Optionally, we also allow to put weights on the adjacencies and take these into account during the minimization.

Problem 1

(Weighted Small Parsimony Linearization Problem) Given a phylogeny $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T=(V,E)$$\end{document}$ , a homology ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\equiv$$\end{document}$ ), a weighting function w for adjacencies, and a parameter $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha \in [0,1]$$\end{document}$ , find a linearization $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {L}_i$$\end{document}$ for each (degenerate) genome $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {D}_i$$\end{document}$ in T, such that

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \sum _{(\mathbb {D}_i,\mathbb {D}_k)\in E} \left( \alpha \, d_{\mathrm {DCJ-ID}}(\mathbb {L}_i,\mathbb {L}_k,\equiv ) \, + \, (\alpha -1) \sum _{ab\in \mathcal {A}(\mathbb {L}_i)\cup \mathcal {A}(\mathbb {L}_k)} w(ab)\right) \end{aligned}$$\end{document}

is minimized.

Because the pairwise comparison of (non-degenerate) natural genomes is already NP-hard, the Weighted Small Parsimony Linearization Problem is NP-hard as well. Doerr and Chauve’s algorithm SPP-DCJ, which solves Problem 1, is therefore formulated as an ILP. Thus, we formulate our improved algorithm in Section 3.3 as an ILP as well.

A new method

Capping-free model

The previous solution by Doerr and Chauve [4] was based on a different graph structure, namely the Capped Multi-Relational Diagram (CMRD).. The CMRD differs from the CFMRD in the way it treats telomeres. In the CMRD of two genomes $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {A}$$\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {B}$$\end{document}$ there exist additional extremity edges between each telomere of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {A}$$\end{document}$ and each telomere of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {B}$$\end{document}$ , leading to additional $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$|\mathcal {T}(\mathbb {A})| \cdot |\mathcal {T}(\mathbb {B})|$$\end{document}$ extremity edges.

When calculating the DCJ-indel distance using the CMRD, one must not only determine the resolved homology, but also a matching between telomeres, that is, on $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {T}(\mathbb {A}) \times \mathcal {T}(\mathbb {B})$$\end{document}$ . As identified in [5], this leads to a superexponential increase of the solution space. As our new method is based on the CFMRD, we can use the formula of Theorem 1 and thus avoid such an increase in the solution space.

On linearizability

It is vital for our method that the degenerate genomes in the phylogeny are linearizable (see Problem 1). However, not all degenerate genomes are linearizable (see Fig. 5). Moreover, not all methods used to infer candidate adjacencies for ancestors guarantee this requirement. In particular DeCoSTAR [14], a method for inferring ancestral genomes that is integrated together with SPP-DCJ into a larger reconstruction workflow detailed in [8], generates conflicting ancestral adjacencies.

As far as we are aware, no algorithms testing for linearizability in polynomial time exist as of yet. However, we give an algorithm here that is able to generate a linearization if one exists, by proxy solving the testing problem.

Recall that $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {T}(\mathbb {D})$$\end{document}$ are the telomeres and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {E}(\mathbb {D})$$\end{document}$ are the extremities of a degenerate genome $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {D}$$\end{document}$ . We are interested in finding a matching M on the adjacencies $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {A}(\mathbb {D})$$\end{document}$ of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {D}$$\end{document}$ , such that each extremity is part of exactly one edge in M. This is equivalent to the linearization problem as any telomeres not part of the matching can then be removed and one obtains a genome.

To see how we are able to determine such a matching, consider the weight function w that assigns to each adjacency edge $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{u,v\}\in \mathcal {A}(\mathbb {D})$$\end{document}$ the number of extremities incident to it: $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$w(\{u,v\}) = |\{u,v\} \cap \mathcal {E}(\mathbb {D})|$$\end{document}$ .

Lemma 1

$\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {D}$$\end{document}$ is linearizable if and only if a maximum weight matching M on the weighted graph $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\big (\mathcal {T}(\mathbb {D})\cup \mathcal {E}(\mathbb {D}),\mathcal {A}(\mathbb {D}),w\big )$$\end{document}$ has total weight $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$|\mathcal {E}(\mathbb {D})|$$\end{document}$ .

Proof

Note that there are no edges $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{u,v\}$$\end{document}$ with both $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$u,v\in \mathcal {T}(\mathbb {D})$$\end{document}$ .

Assume a matching $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$M_S$$\end{document}$ that covers the subset $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$S\subseteq \mathcal {E}(\mathbb {D})$$\end{document}$ . We further subdivide S into the disjoint sets $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$S_1$$\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$S_2$$\end{document}$ . $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$S_1$$\end{document}$ contains all vertices $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$v\in S$$\end{document}$ that are matched with a telomere, that is $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(v,u)\in M_S$$\end{document}$ with $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$u\in \mathcal {T}(\mathbb {D})$$\end{document}$ . $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$S_2$$\end{document}$ contains the vertices that are matched with another extremity (note that for $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$v\in S_2$$\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(v,u)\in M_S$$\end{document}$ follows $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$u\in S_2$$\end{document}$ ). Since there are no edges between telomeres directly, the total weight of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$M_S$$\end{document}$ is

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \sum _{\{u,v\}\in M_S} w(\{u,v\})&= \sum _{\{u,v\}\in M_S, u \text { or }v \in S_1} w(\{u,v\}) + \sum _{\{u,v\}\in M_S,u,v\in S_2} w(\{u,v\})\\ &= |S_1| + 2 \frac{|S_2|}{2} = |S| \end{aligned}$$\end{document}

We thus see that a matching has weight k if and only if it covers a subset of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {E}(\mathbb {D})$$\end{document}$ of size k. The claim of the lemma follows by noting that a matching can have at most weight $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$|\mathcal {E}(\mathbb {D})|$$\end{document}$ and that if such a matching $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$M_\mathcal {E}$$\end{document}$ exists, we can use $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$M_\mathcal {E}$$\end{document}$ as the adjacencies of the linearization of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {D}$$\end{document}$ . $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\square$$\end{document}$

Using Lemma 1, we can either find that there is no linearization or determine one using a standard maximum weight matching algorithm for any degenerate genome $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {D}$$\end{document}$ .Fig. 5. Left: This degenerate genome is not linearizable because of missing telomeres. Right: The genome becomes linearizable when adding telomeres. One linearization is that of Fig. 4

While we can test whether genomes are linearizable using this maximum weight matching algorithm, previous versions of SPP-DCJ modified the given degenerate genomes by adding telomeres, such that they are guaranteed to be linearizable, which may still be desirable on noisy data (see Subsection 3.2.2). We detail these methods briefly in the following subsections.

Local guarantees

The first method of guaranteeing linearizability relies on the following lemma.

Lemma 2

A perfect matching $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$M \subseteq \mathcal {A}(\mathbb {D})$$\end{document}$ in a degenerate genome $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {D}= (\mathcal {E}(\mathbb {D})\cup \mathcal {T}(\mathbb {D}), \mathcal {M}(\mathbb {D})\cup \mathcal {A}(\mathbb {D}))$$\end{document}$ corresponds to a linearization of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {D}$$\end{document}$ .

Proof

Observe that in the M-induced degenerate genome $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {D}' = (\mathcal {E}(\mathbb {D}) \cup \mathcal {T}(\mathbb {D}), \mathcal {M}(\mathbb {D}) \cup M)$$\end{document}$ each node is incident to exactly one adjacency edge. Further each connected component corresponds to a linear component where both degree-one nodes correspond to telomeres, or a circular component where each node corresponds to a marker extremity. $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\square$$\end{document}$

However, the converse is not true: Since not all telomeric extremities must be covered, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {D}$$\end{document}$ may still be linearizable even if no perfect matching may be derived from $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {D}$$\end{document}$ .

In an earlier version of SPP-DCJ [4], a simple approach was introduced that complements each degenerate genome $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {D}$$\end{document}$ with additional telomeres and telomeric adjacencies to ensure linearizability. To this end, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {D}$$\end{document}$ is decomposed into connected components that are independently tested. If the size of a component, i.e., the number of its vertices, is even, and it is either linear, circular, or fully connected, then it is considered as locally linearizable. Otherwise, each extremity v of the component is complemented with a telomere $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$t_v$$\end{document}$ , and a telomeric adjacency $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{v,t_v\}$$\end{document}$ is added to the degenerate genome, ensuring that it is linearizable as a whole.

Allowing each extremity to be connected to a telomere

Given the uncertainty about inferred ancestral adjacencies, even when a component is locally linearizable, individual adjacencies of that component might still be wrongly inferred by the pre-processing and thus might be erroneously included in the linearization, simply because otherwise a linearization might not be possible.

In order to prevent this behavior, we offer a mode in which each extremity is connected to an (artificially introduced) telomere to reflect this uncertainty. In contrast to the method described above, we do this even in components with local guarantees.

This approach was previously practically unsound because of inefficient handling of telomeres. Now it may become the standard mode of operation, as it allows to find reasonable solutions in case of noisy input data, while the computational overhead introduced by the addition of the artificial telomeres remains moderate. We refer to this mode as the safer linearization mode in subsequent sections.

A new ILP formulation

Algorithm 1 gives an overview of our method with additional tables detailing parts of the ILP.

In principle, our algorithm solves Problem 1 in the same way as SPP-DCJ [4], namely it determines linearizations while simultaneously computing the distances between nodes in the phylogeny with the objective of minimizing the total distance. However, for ease of readability, we separate the linearization and distance computation into two different subsections.

On the global level, the linearizations $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {L}_i$$\end{document}$ are derived for each (degenerate) genome $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {D}_i$$\end{document}$ . On the local level, the resulting linearizations are compared to each other along the branches of the phylogeny. Each branch gives rise to a pairwise comparison by means of the CFMRD. In doing so, the selection of adjacencies of a derived genome is propagated from across CFMRDs, thus ensuring global consistency.

The main differences between our algorithm and that in [4] are found in the local level, as this is where the CFMRD plays a role.

Global level

The global level deals with the setting of adjacencies or telomeres of (ancestral) genomes. For each (marker or telomeric) extremity v, we determine its presence or absence with a binary variable $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {g}}_v$$\end{document}$ . For markers, the head extremity is present if and only if the tail extremity is (see Constraint C.01). Since there is often uncertainty about the precise copy number of markers in ancestral genomes, we allow user-determined bounds $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(L_F^\mathbb {A},H_F^\mathbb {A})$$\end{document}$ for the number of markers in each family F in ancestral genome $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {A}$$\end{document}$ (C.02). If not specified, these bounds default to the maximum, requiring each marker to occur, that is they collapse to

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} (\texttt {C.01A}) \quad {\texttt {g}}_v = 1 \quad v\in \{m^t,m^h\} \text { for } m \in \mathcal {M}\text { with } L_{[m]}^\mathbb {A}= H_{[m]}^\mathbb {A}=|[m]\cap \mathcal {M}|. \end{aligned}$$\end{document}

Each extremity present is then required to be part of exactly one (possibly telomeric) adjacency (C.03), which ensures a properly linearized genome.

Local level

The local level deals with each edge of the tree separately, making use of the CFMRD of the corresponding genome pair. Since this part is entirely local to the edge in question, we presume that each vertex $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$v_i$$\end{document}$ of the CFMRD has a unique identifier among all other CFMRDs , making all its variables globally unique. In order to limit the range of the general variable $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$y_{v_i}$$\end{document}$ , we also assign each vertex a rank i that is local and unique only within the specific CFMRD . We map each extremity to its identifier for the global level by the function $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma$$\end{document}$ .

In order to compute decompositions of CFMRDs, we make use of a capping-free formulation for the computation of the pairwise DCJ indel distance derived in [6]. This formulation is based on the distance formula found in Theorem 1.

The formulation counts cycles $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {c}}_E$$\end{document}$ as well as the six different types of paths relevant to Theorem 1, namely $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {p}}^{ab},{\texttt {p}}^{Aa},{\texttt {p}}^{aB},{\texttt {p}}^{Ab},{\texttt {p}}^{Bb},{\texttt {p}}^{AB}$$\end{document}$ . Each counting variable $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {p}}^X$$\end{document}$ is set by summing up binary report variables $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {r}}^X_v$$\end{document}$ that are set to 1 once per component on a specific vertex v (see Constraints C.09 to C.13 and C.18). These counters are then combined to the terms of the formula in Constraints C.14 to C.17 and C.04 to C.08. The constraints for ensuring the reporting variables being set correctly can be found in Tables 1, 2 and 3. For a complete description of this part of the ILP the interested reader is referred to [6].Table 1. Shao-Lin-Moret constraints [12](C.27) $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {x}}_e={\texttt {x}}_d$$\end{document}$ for all sibling edges e, d(C.28) $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {y}}_{v_i} + j(1-{\texttt {x}}_{u_jv_i}) \ge {\texttt {y}}_{u_j}$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\forall u_jv_i \in {E}_\text {adj}\cup {E}_{\text {ext}}$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$j(1-{\texttt {x}}_{u_jv_i})\ge {\texttt {y}}_{u_j}$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\forall u_jv_i \in {E}_\text {self}$$\end{document}$ (C.29) $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i{\texttt {z}}_{v_i} \le {\texttt {y}}_{v_i}$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\forall v\in \mathcal {E}\cup \mathcal {T}$$\end{document}$ Table 2. Reporting for regular vertices(C.30) $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {l}}_v \le 1 - {\texttt {x}}_{uv}$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\forall uv \in {E}_\text {self},u\in \mathcal {E}(\mathbb {A})$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {l}}_v \ge {\texttt {x}}_{uv}$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\forall uv \in {E}_\text {self},u\in \mathcal {E}(\mathbb {B})$$\end{document}$ (C.31) $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {l}}_v \le {\texttt {l}}_u + (1-{\texttt {x}}_{uv})$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\forall uv \in {E}_{\text {ext}}$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {l}}_u \le {\texttt {l}}_v + {\texttt {r}}_{uv}^{ab} + (1-{\texttt {x}}_{uv})$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\forall uv\in E_{adj},u\in \mathcal {E}(\mathbb {A})$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {l}}_u \le {\texttt {l}}_v + (1-{\texttt {x}}_{uv})$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\forall uv\in E_{adj},u\in \mathcal {E}(\mathbb {B})$$\end{document}$ (C.32) $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {r}}_{v}^c \le {\texttt {z}}_v$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\forall v \in \mathcal {E}(\mathbb {A})$$\end{document}$ (C.33) $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {r}}_{u}^{ab} \le {\texttt {x}}_{uv}$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\forall uv\in {E}_\text {self},u\in \mathcal {E}(\mathbb {A})$$\end{document}$ Table 3. Reporting for telomeres(C.34) $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {l}}_v = 0$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\forall v \in \mathcal {T}(\mathbb {A})$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {l}}_v = 1$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\forall v \in \mathcal {T}(\mathbb {B})$$\end{document}$ (C.35) $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {l}}_u \le {\texttt {l}}_v + {\texttt {r}}_{v}^{AB} + {\texttt {r}}_{v}^{Ab} + (1-{\texttt {x}}_{uv})$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\forall uv\in {E}_\text {adj}, v \in \mathcal {T}(\mathbb {A})$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {l}}_u \le {\texttt {l}}_v + {\texttt {r}}_{u}^{aB} + (1-{\texttt {x}}_{uv})$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\forall uv\in {E}_\text {adj}, u\in \mathcal {T}(\mathbb {B})$$\end{document}$ (C.36) $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {r}}_{v}^{AB} \le {\texttt {z}}_v$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\forall v \in \mathcal {T}(\mathbb {A})$$\end{document}$ (c.37) $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1 -{\texttt {y}}_v \le {\texttt {r}}^{Ab}_{v} + {\texttt {r}}^{Aa}_v$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$v \in \mathcal {T}(\mathbb {A})$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1 -{\texttt {y}}_v \le {\texttt {r}}^{aB}_{v} + {\texttt {r}}^{Bb}_v$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$v \in \mathcal {T}(\mathbb {B})$$\end{document}$ (C.38) $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {y}}_{v_i} \le i(1-{\texttt {r}}_{v}^{R})$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$v\in \mathcal {T}(\mathbb {A}),R\in \{Ab, Aa\}$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {y}}_{v_i} \le i(1-{\texttt {r}}_{v}^{R})$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$v\in \mathcal {T}(\mathbb {B}),R\in \{aB, Bb\}$$\end{document}$ (C.39) $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {r}}_{v}^{AB} \le {\texttt {l}}_u + (1-x_{uv})$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\forall uv\in {E}_\text {adj}, v \in \mathcal {T}(\mathbb {A})$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {r}}_{v}^{Ab} \le {\texttt {l}}_u + (1-x_{uv})$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\forall uv\in {E}_\text {adj}, v \in \mathcal {T}(\mathbb {A})$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {r}}_{v}^{aB} \le 1-{\texttt {l}}_u + (1-x_{uv})$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\forall uv\in {E}_\text {adj}, v \in \mathcal {T}(\mathbb {B})$$\end{document}$

We make only few major changes in our local section w.r.t. the ILP described in [6]. Firstly, we determine whether an adjacency edge e is set ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {x}}_e=1$$\end{document}$ ) by “inheriting” this value from the linearization generated in the global section (see C.21) of the corresponding adjacency. Secondly, we allow only vertices that are part of the linearized genome ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {g}}_v=1$$\end{document}$ ) to contribute to the count of components that decrease the formula ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {z}}_v=1$$\end{document}$ ), see C.22. To enforce the Maximum matching model, for any family we allow self edges only in one of the two genomes (C.23). If it is clear from the bounds, in which genome the family will be overrepresented, the self edges in the underrepresented genome can be removed and the constraint can be dropped.

Due to the fact that ancestral genomes may be degenerate, the number of possible circular singletons can be as large as the number of possible circular chromosomes. Listing all candidates, such as is done in [6] and in SPP-DCJ [4], leads to a combinatorial explosion on certain input data. Particularly, when all possible adjacencies are present in the degenerate genome, any non-empty subset of singular markers can form a circular singleton. A lower bound on the number of candidates is therefore $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sum _{i=1}^{|{E}_\text {self}|} \left( {\begin{array}{c}|{E}_\text {self}|\\ i\end{array}}\right) = 2^{|{E}_\text {self}|}-1$$\end{document}$ . To avoid an exponential worst case size of our ILP, we use a new technique for counting circular singletons without listing all candidates when the number of candidates is larger than a given (polynomial) threshold, which we arbitrarily set at twice the number of self edges. The constraints for this technique are listed in Table 4 and described in the following.

Algorithm 1Capping-free Small Parsimony Table 4. Reporting circular singletons(C.24) $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {d}}_u+{\texttt {d}}_v + {\texttt {x}}_{uv}\le 2$$\end{document}$

$\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\forall uv \in {E}_\text {adj}\cup {E}_\text {self}$$\end{document}$

$\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {d}}_u + {\texttt {d}}_v - {\texttt {x}}_{uv} \ge 0$$\end{document}$

$\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\forall uv \in {E}_\text {adj}\cup {E}_\text {self}$$\end{document}$ (C.25) $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {w}}_u = {\texttt {w}}_v$$\end{document}$

$\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\forall uv \in {E}_\text {self}$$\end{document}$ (C.26) $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K (1- {\texttt {x}}_{uv} + {\texttt {r}}^{s}_{u} + {\texttt {r}}^{s}_{v}) + {\texttt {w}}_v \ge {\texttt {w}}_u + {\texttt {d}}_v - {\texttt {d}}_u$$\end{document}$

$\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\forall uv\in {E}_\text {adj}$$\end{document}$

A circular singleton manifests in the graph as a cycle of alternating adjacency and indel edges. The idea of the technique is to have a general integer variable $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {w}}$$\end{document}$ that is required to increase at each adjacency edge in a walk of the cycle. There must then be one point in the walk in which it decreases again. Detecting this, one can then report a circular singleton. For this to work, the walk needs a direction. This is accomplished by annotating the vertices with a binary variable $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {d}}_v$$\end{document}$ that “flips” across each pair of connected vertices (see C.24). We then require $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {w}}$$\end{document}$ to be the same for vertices connected by an indel edge (see C.25) and for it to increase by 1 in the direction of the vertex that has $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {d}}_v=1$$\end{document}$ (see C.26). We require this except when vertices are not connected ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1-{\texttt {x}}_{uv}=0$$\end{document}$ ) or when reporting a circular singleton ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {r}}^s_u=1$$\end{document}$ or $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {r}}^s_v=1$$\end{document}$ ). In this case, the constraint is automatically fulfilled by adding the maximum length of circular singletons K to the left hand side of the inequation.Table 5. Domains - global level(D.01) $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {g}}_v\in \{0,1\}$$\end{document}$ for each genome $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {X}$$\end{document}$ , $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\forall v \in \mathcal {E}(\mathbb {X})\cup \mathcal {T}(\mathbb {X})$$\end{document}$ (D.02) $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {f}}_E,{\texttt {n}}_E,{\texttt {c}}_E,{\texttt {s}}_E\in \mathbb {N}_0$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\forall E\in E(T)$$\end{document}$ (D.03) $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {p}}^{xy}_E,p^{\max a}_E,p^{\max b}_E\in \mathbb {N}_0$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\forall E\in E(T)\, \forall x,y\in \{A,B,a,b\},x\ne y$$\end{document}$ (D.04) $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q_E\in \mathbb {Z}$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\forall E\in E(T)$$\end{document}$ (D.05) $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {w}}_E\in \mathbb {R}$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\forall E\in E(T)$$\end{document}$ Table 6. Domains - local level. For each edge $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\mathbb {A},\mathbb {B}) \in E(T)$$\end{document}$ with $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {CFMRD}(\mathbb {A},\mathbb {B}) = (\mathcal {E}\cup \mathcal {T}, E_\text {all})$$\end{document}$ with $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$E_\text {all} = {E}_\text {adj}\cup {E}_{\text {ext}}\cup {E}_\text {self}$$\end{document}$ :(D.06) $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {x}}_{e}\in \{0,1\}$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\forall e\in E_\text {all}$$\end{document}$ (D.07) $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {y}}_{v_i}\in \{0,...,i\}$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$v_i\in \mathcal {E}\cup \mathcal {T}$$\end{document}$ (D.08) $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {z}}_{v},l_v\in \{0,1\}$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$v\in \mathcal {E}\cup \mathcal {T}$$\end{document}$ (D.09) $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {d}}_v\in \{0,1\}$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$v\in \mathcal {E}$$\end{document}$ (D.10) $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\texttt {w}}_v\in \mathbb {N}_0$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$v\in \mathcal {E}$$\end{document}$ (D.11) $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r^{ab}_v\in \{0,1\}$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\forall v\in \mathcal {E}(\mathbb {A})$$\end{document}$ (D.12) $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r^{Aa}_v,r^{Ab}_v,r^{AB}_v\in \{0,1\}$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\forall v\in \mathcal {T}(\mathbb {A})$$\end{document}$ (D.13) $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r^{aB}_v,r^{Bb}_v\in \{0,1\}$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\forall v \in \mathcal {T}(\mathbb {B})$$\end{document}$ (D.14) $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$b_f\in \{0,1\}$$\end{document}$ for each family f

Size of the ILP

For each CFMRD , the local level of the ILP assigns a constant number of variables to each vertex and edge (see Table 6). Additionally there is a constant number of constraints associated with each vertex and edge (see Tables 1, 2, 3, 4). For each edge in the phylogeny, there is a constant number of constraints and variables per edge or vertex in the global level (see C.01 to C.03 and Table 5 respectively). The size of the ILP is thus linear with respect to the total size of all CFMRDs of the tree.

Pre-processing

We provide two pre-processing options aimed at reducing the solution space. Firstly, we give the option to calculate an initial solution the solver starts with – guaranteeing that an approximate solution will be found, but also providing an immediate upper bound on the problem. The algorithm to compute such a solution proceeds in two steps, corresponding to the global and local level of the ILP respectively. In the first step, the algorithm determines linearizations for all ancestral genomes using the algorithm described in Section 3.2, taking into account the weights of the adjacencies. As a second step, decompositions for each CFMRD are determined by greedily fixing cycles in order of ascending length in the graph.

The second option for pre-processing allows us to bound the solution from below by using knowledge not available to the solver. To see how this method works, consider two genomes $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {A},\mathbb {B}$$\end{document}$ and a degenerate genome $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {D}$$\end{document}$ . Transforming $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {A}$$\end{document}$ into a linearization of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {D}$$\end{document}$ and this linearization into $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {B}$$\end{document}$ must use at least as many DCJ- and indel-operations as transforming $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {A}$$\end{document}$ into $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {B}$$\end{document}$ via any intermediate genome $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {C}$$\end{document}$ with the same copy-numbers of families as $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {D}$$\end{document}$ . This idea can be generalized to multiple intermediate genomes. Thus, by precomputing the distance $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d(\mathbb {A},\mathbb {B})$$\end{document}$ between leaves using ding [6] while taking into account the number of occurrences per family, we can derive the following additional global constraint:

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} (\texttt {C.opt}) \sum _{\begin{array}{c} E \in E(T)\\ E \text { on path between }\mathbb {A},\mathbb {B} \end{array}} {\texttt {f}}_E \ge d(\mathbb {A},\mathbb {B}) \quad \text { for all pairs of leaves }\mathbb {A},\mathbb {B}. \end{aligned}$$\end{document}

Evaluation

We implemented Algorithm 1 and made it publicly available1. We refer to this algorithm as SPP-DCJ-v2 in the following. We performed a number of different experiments evaluating the solving time under different conditions as compared to SPP-DCJ as well as precision and recall for the safer linearization mode.

While solving the same problem, SPP-DCJ adds another parameter $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta$$\end{document}$ to the optimization which gives further negative weight to telomeres. In short, the optimization function of SPP-DCJ is equivalent to the form

Minimize

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \alpha '\sum _{E\in E(T)} {\texttt {f}}_E \, + \, \beta ' \sum _{E\in E(T)} \# \text {telomeres in decompositions of }E \, - \, (1-\alpha ' - \beta ') \sum _{E\in E(T)} {\texttt {w}}_E \end{aligned}$$\end{document}

We can simulate this behavior in our ILP by decreasing the assigned weight of telomeric adjacencies and by using a re-scaled $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha$$\end{document}$ .

When comparing to SPP-DCJ, we thus used default settings for SPP-DCJ with $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha '=\frac{1}{2}$$\end{document}$ , $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta '=\frac{1}{4}$$\end{document}$ . This corresponds in our ILP to $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha =\frac{2}{3}$$\end{document}$ and reducing the weight of each telomeric adjacency by 1, so we used these parameters for SPP-DCJ-v2 when comparing to SPP-DCJ.

We used gurobi version 12.0.0 on a single thread and with a time limit of 1 hour (3600 seconds) to solve the ILPs unless specified otherwise.

Performance on linear genomes

In order to compare the behavior of SPP-DCJ and SPP-DCJ-v2 in the presence of multiple linear chromosomes, we used the simulator ffs-dcj introduced in [6]. The simulator performs a number of DCJs, indels and duplications with fixed rates for a given tree topology. In this experiment, we used a fixed balanced tree topology, namely (((A : 1.0, B : 1.0)F : 1.0), ((C : 1.0, D : 1.0)G : 1.0))Root; . We simulated 30 operations per branch on genomes of size 100 markers. More detailed settings (such as rates of duplications and indels) can be found in Table 7. The experiment was run for 2, 4, 6, 8, 10, 12, 14 and 16 linear chromosomes at the root of the tree with 10 replicates for each step. We then proceeded to introduce 30 adjacencies of adversarial noise for each sample at the inner nodes utilizing a script provided by the SPP-DCJ repository.Table 7. Parameters for ffs-DCJ for the linear chromosome experimentDuplication rate0.4Zipf parameter duplication6.0Deletion Rate0.2Insertion Rate0.1Zipf parameter indel4.0

Fig. 6. Solving times for SPP-DCJ and SPP-DCJ-v2 on simulated genomes with increasing numbers of telomeres. Solid lines represent corresponding median values

We then ran SPP-DCJ and SPP-DCJ-v2 on degenerate genomes consisting of the true and noise adjacencies. In order to ensure a fair comparison, we did not perform the performance optimizing pre-processing steps from Section "Pre-processing" or give any ranges for marker multiplicities in ancestral genomes for SPP-DCJ-v2. The results in solving time are shown in Fig. 6.

We see that SPP-DCJ-v2 on average needed more than two orders of magnitude less solving time than SPP-DCJ and even comparing the best run of SPP-DCJ to the worst of SPP-DCJ-v2 per step, the difference is still about one order of magnitude.

A majority of SPP-DCJ runs did not complete within the time limit. The performance of SPP-DCJ also dramatically worsens with increasing numbers of linear chromosomes, such that no ILPs were solved within the time limit for 16 chromosomes.

SPP-DCJ-v2 in turn was also affected by the rising numbers of linear chromosomes, but the effect is less drastic. In fact, the solving time for SPP-DCJ-v2 is well below a minute for all samples.

Performance on circular genomes

As we have seen in Section "On linearizability", even when in the ground truth all linearizations of chromosomes are circular, additional telomeres might still be necessary to ensure that all degenerate genomes are linearizable.

In order to examine this effect, we used the same pipeline as in [4] to simulate trees and genomes of 100 markers for each tree using ZOMBI [15] with tree scales ranging from 5 to 20 with 50 samples per step (for all parameter settings see Table 8). We then inferred degenerate genomes using DeCoSTAR [14] and solved the resulting SPP instances using SPP-DCJ and SPP-DCJ-v2, the latter again without additional pre-processing. We visualize the resulting solving times in Fig. 7.Table 8. Parameter settings for ZOMBI and DeCoSTAR for the tree scale and precision experiments. For the sake of benchmarking SPP-DCJ-v2, ZOMBI parameters for genome evolution were chosen to represent an elevated degree of genome evolution, both in terms of gene content innovation (duplication+loss) and rearrangement (inversion+transposition)ZOMBIDUPLICATIONf:2INITIAL_GENOME_SIZE100LOSSf:2LOSS_EXTENSIONg:0.8ORIGINATIONf:0INVERSIONf:2INVERSION_EXTENSIONg:0.5TRANSPOSITIONf:2TRANSPOSITION_EXTENSIONg:0.5DeCoSTARuse.boltzmann1boltzmann.temperature1.0nb.sample1000

Fig. 7. Solving times for SPP-DCJ and SPP-DCJ-v2 on genomes generated by ZOMBI on a range of trees with increasing branch lengths with ancestral adjacencies inferred by DecoSTAR. Solid lines represent corresponding median values

Genomes generated by ZOMBI are circular, so one might assume that there is only negligible difference in runtime between SPP-DCJ and SPP-DCJ-v2. However, the results indicate that the improved handling of the solution space by SPP-DCJ-v2 allows it to solve problem instances with up to twice the tree scale as SPP-DCJ with comparable solving times.

Unexpectedly, the median solving times of SPP-DCJ-v2 decrease after a maximum at tree scale 15. We conjecture that this might be because enough rearrangements accumulate to make the genomes behave as if not related, making the problem easier to solve. SPP-DCJ however seems not to benefit from this effect in the tested tree scale and time range as the median solving time reaches the time limit for a tree scale of 11 and does not recover.

Evaluation of the safer linearization mode

We used the same pipeline to simulate genomes of 1000 markers with ZOMBI, inferring degenerate ancestral genomes with DecoSTAR over a range of tree scales with five samples per step. All other parameters are the same as in Table 8. This time, however, we used SPP-DCJ-v2 with both the default and the safer linearization modes and examined the precision and recall of recovered adjacencies. In this experiment, we used $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha =0.5$$\end{document}$ with weight 0 for the telomeric adjacencies added to ensure linearizability (see Section "On linearizability").Fig. 8. Mean precision, recall and F1 score for default and safer linearization mode for varying tree scales. Transparent ranges indicate minimum to maximum range of the five tested samples per step

The results, illustrated in Fig. 8, indicate that while our method displays very high precision and recall rates in both modes, the safer linearization mode has a minor, but consistent advantage over the default setting, especially considering precision. The trend in the data shows that this gap could widen further on more noisy data.

Evaluating the Effect of Initial Solution and Lower Bounds

We ran the ZOMBI pipeline again with the parameters detailed in Table 8 for tree scales 5, 10, 15, and 20, generating 50 samples each. This time, we ran only SPP-DCJ-v2 and examined the effect of providing an initial solution or lower bounds for the ILP. The average solving and pre-processing times are given in Fig. 9.Fig. 9. Average pre-processing and solving times of 50 samples for variants of SPP-DCJ-v2. NN - no additional pre-processing, IN - initial solution precomputed, IB - initial solution and lower bounds precomputed

While there is a slight trend in decreasing solving times with additional pre-processing, the time needed to apply the pre-processing itself dominates much of the runtime, especially on lower tree scales. Even on high tree scales, the benefit for the solving time seems to be outweighed by the time needed to complete the pre-processing. We therefore do not recommend computing an initial solution, unless no approximate solution would be found otherwise. In the same vein, we do not recommend precomputing lower bounds, unless they were already used to construct the phylogeny. Possibly the precomputation of lower bounds could be improved by only precomputing distances for certain pairs of leaves, and not all of them, which decreases the runtime while possibly keeping the positive effects on solving time.

Reconstructing the ancestral X chromosomes of seven mosquitos

We further evaluated our method on biological data from seven Anopheles species whose inferred phylogeny is depicted in Fig. 10. Gene annotations from protein coding genes of the X chromosome of present-day mosquitos were obtained from VectorBase [16]. Chromosome sizes fluctuated at around 600 genes. We then used the ancestral gene order (AGO) pipeline [8] to obtain candidate ancestral adjacencies. Using AGO, multiple sequence alignments were computed with MACSE [17], based upon which gene trees were inferred and reconciled with the species tree with IQ-TREE [18]. Finally, candidate ancestral adjacencies were computed with DeCoSTAR.Fig. 10. Cladogram for seven Anopheles taxa

We ran SPP-DCJ and SPP-DCJ-v2 with varying optimization levels to generate corresponding ILPs and initial solutions where applicable. Additionally, we generated an ILP based on SPP-DCJ-v2, for which we allowed copy numbers in ancestral families to deviate by one from the maximum number of copies. The resulting ILPs were then input to gurobi 12.0.0, which ran on 10 threads with a time limit of 12 hours on the same machine for all ILPs.

We visualize the gaps reported by gurobi over time in Fig. 11. For all variants of SPP-DCJ-v2 gurobi found solutions with significantly closer gaps than for SPP-DCJ. In fact, results as close as the final result for SPP-DCJ were found for all versions of SPP-DCJ-v2 within the first 25 minutes of solving time.

As before, the pre-processing optimizations have only a minor effect on the quality of the result at the end of solving time. Indeed, the effect is most strongly visible within the first few minutes of solving time, after which gurobi’s own heuristic solutions start to overshadow the initial solutions found in pre-processing.

Interestingly, allowing for uncertainty about the multiplicity of the families in ancestral genomes did not slow computation, but had an immense speed up effect. This suggests that doing so allows to find solutions better fitting the given phylogeny and adjacencies.Fig. 11. Gaps reported by gurobi with increasing solving times for SPP-DCJ and variants of SPP-DCJ-v2 until a time limit of 720 minutes. Right: Zoomed in on the first 25 minutes. NN - no additional pre-processing, IN - initial solution precomputed, IB - initial solution and lower bounds precomputed, IBF - initial solution and lower bounds precomputed, with variable ancestral family sizes

Discussion

We presented SPP-DCJ-v2, the first ILP of polynomial size to solve the Small Parsimony Problem for natural genomes under the DCJ-indel model. Using a more efficient representation of the solution space, the Capping-Free Multi-Relational Diagram, we were able to significantly improve upon the performance of its predecessor, SPP-DCJ. Additionally, we introduced a new method of ensuring linearizability that is more robust when applied to (potentially noisy) real data because linearization is not the main constraint any more. We also introduced a feature that allows users to specify their own bounds on marker multiplicities in ancestral genomes, which may help in ambiguous cases on real data. We evaluated our method on simulated data and found it to be more efficient than its predecessor. Additionally bounds on the solution space do not seem to help performance, especially when considering the additional time needed for pre-processing, but could potentially be helpful on very large problem instances. Finally, we demonstrated that our approach is efficient enough to derive good solutions for SPP on real phylogenies within reasonable time frames.