# An alignment-free method for detection of missing regions for phylogenetic analysis

**Authors:** Rubyeat Islam, Atif Rahman

PMC · DOI: 10.1016/j.heliyon.2024.e32227 · Heliyon · 2024-06-04

## TL;DR

This paper introduces a new alignment-free method to detect missing genomic regions in phylogenetic analysis, improving accuracy without relying on traditional sequence alignment.

## Contribution

A novel alignment-free approach for identifying missing regions in phylogenetic sequences using k-mer counts.

## Key findings

- The method successfully detects a large fraction of k-mers corresponding to missing regions in sequences.
- Using the method improves the accuracy of estimated phylogenies in datasets with missing regions.
- The approach is effective on both real and simulated datasets with missing genomic regions.

## Abstract

Phylogenetic tree estimation using conventional approaches usually requires pairwise or multiple sequence alignment. However, sequence alignment has difficulties related to scalability and accuracy in case of long sequences such as whole genomes, low sequence identity, and in presence of genomic rearrangements. To address these issues, alignment-free approaches have been proposed. While these methods have demonstrated promising results, many of these lead to errors when regions are missing from the sequences of one or more species that are trivially detected in alignment-based methods. Here, we present an alignment-free method for detecting missing regions in sequences of species for which phylogeny is to be estimated. It is based on counts of k-mers and can be used to filter out k-mers belonging to regions in one species that are missing in one or more of the other species. We perform experiments with real and simulated datasets containing missing regions and find that it can successfully detect a large fraction of such k-mers and can lead to improvements in the estimated phylogenies. Our method can be used in k-mer based alignment-free phylogeny estimation methods to filter out k-mers corresponding to missing regions.

## Full-text entities

- **Diseases:** Mahalanobis Distance (MESH:C535290)
- **Species:** Gorilla (genus) [taxon 9592], Pan troglodytes (chimpanzee, species) [taxon 9598], Escherichia coli (E. coli, species) [taxon 562], Shigella (genus) [taxon 620], Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC11200290/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/PMC11200290/full.md

## References

39 references — full list in the complete paper: https://tomesphere.com/paper/PMC11200290/full.md

---
Source: https://tomesphere.com/paper/PMC11200290