# Global exact optimisations for chloroplast structural haplotype scaffolding

**Authors:** Victor Epain, Rumen Andonov

PMC · DOI: 10.1186/s13015-023-00243-1 · 2024-02-06

## TL;DR

This paper introduces a new method for assembling chloroplast genomes by using mathematical optimization to scaffold genomic regions, particularly focusing on repeats and multiple genome forms.

## Contribution

A novel discrete optimization formulation for chloroplast scaffolding is introduced, proven NP-Complete, and implemented in a Python package.

## Key findings

- A new formulation of the scaffolding problem for chloroplast genomes was developed and proven NP-Complete.
- The approach successfully models genomic regions and repeats to scaffold multiple genome forms within a single chloroplast cell.
- The method was implemented in a Python package and tested on synthetic data to evaluate performance and robustness.

## Abstract

Scaffolding is an intermediate stage of fragment assembly. It consists in orienting and ordering the contigs obtained by the assembly of the sequencing reads. In the general case, the problem has been largely studied with the use of distances data between the contigs. Here we focus on a dedicated scaffolding for the chloroplast genomes. As these genomes are small, circular and with few specific repeats, numerous approaches have been proposed to assemble them. However, their specificities have not been sufficiently exploited.

We give a new formulation for the scaffolding in the case of chloroplast genomes as a discrete optimisation problem, that we prove the decision version to be \documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$\mathcal{NP}$$\end{document}NP-Complete. We take advantage of the knowledge of chloroplast genomes and succeed in expressing the relationships between a few specific genomic repeats in mathematical constraints. Our approach is independent of the distances and adopts a genomic regions view, with the priority on scaffolding the repeats first. In this way, we encode the structural haplotype issue in order to retrieve several genome forms that coexist in the same chloroplast cell. To solve exactly the optimisation problem, we develop an integer linear program that we implement in Python3 package khloraascaf. We test it on synthetic data to investigate its performance behaviour and its robustness against several chosen difficulties.

We succeed to model biological knowledge on genomic structures to scaffold chloroplast genomes. Our results suggest that modelling genomic regions is sufficient for scaffolding repeats and is suitable for finding several solutions corresponding to several genome forms.

## Full-text entities

- **Genes:** MATK (megakaryocyte-associated tyrosine kinase) [NCBI Gene 4145] {aka CHK, CTK, HHYLTK, HYL, HYLTK, Lsk}
- **Diseases:** IR (MESH:C566127), DR (MESH:D051556)
- **Chemicals:** Nucleotide (MESH:D009711), ILP (-)
- **Species:** Taxus baccata (English yew, species) [taxon 25629], Sciadopitys verticillata (umbrella-pine, species) [taxon 28979], Lophocereus schottii (species) [taxon 153875], Commiphora foliacea (species) [taxon 1173001], Cyanobacterium (genus) [taxon 102234], Jasminum tortuosum (species) [taxon 1548298], Triosteum pinnatifidum (species) [taxon 134526], Juniperus scopulorum (species) [taxon 466205], Carpodetus serratus (species) [taxon 54173], Musa ornata (flowering banana, species) [taxon 160690], Welwitschia mirabilis (species) [taxon 3377], Lamprocapnos spectabilis (bleeding heart, species) [taxon 54415], Lathyrus pubescens (species) [taxon 313107], Agathis dammara (species) [taxon 60851], Eucommia ulmoides (species) [taxon 4392], Begonia pulchrifolia (species) [taxon 1691898], PX clade (clade) [taxon 569578], Pelargonium nanum (species) [taxon 59882], Podocarpus totara (species) [taxon 56901]

## Figures

18 figures with captions in the complete paper: https://tomesphere.com/paper/PMC11288059/full.md

---
Source: https://tomesphere.com/paper/PMC11288059