# Approximating edit distances between complex tandem repeats efficiently

**Authors:** Riki Kawahara, Shinichi Morishita

PMC · DOI: 10.1093/bioinformatics/btaf155 · 2025-04-09

## TL;DR

This paper introduces a fast algorithm to estimate edit distances between complex tandem repeats, which are linked to various diseases and evolutionary diversity.

## Contribution

The novel contribution is an efficient heuristic algorithm (hEDDC) that estimates edit distances with high accuracy and significant speed improvements.

## Key findings

- The proposed algorithm achieves a Pearson correlation coefficient of >0.983 with accurate edit distances.
- The heuristic algorithm provides orders of magnitude performance speedup compared to traditional methods.

## Abstract

Extended tandem repeats (TRs) have been associated with 60 or more diseases over the past 30 years. Although most TRs have single repeat units (or motifs), complex TRs with different units have recently been correlated with some brain disorders. Of note, a population-scale analysis shows that complex TRs at one locus can be divergent, and different units are often expanded between individuals. To understand the evolution of high TR diversity, it is informative to visualize a phylogenetic tree. To do this, we need to measure the edit distance between pairs of complex TRs by considering duplication and contraction of units created by replication slippage. However, traditional rigorous algorithms for this purpose are computationally expensive.

We here propose an efficient heuristic algorithm to estimate the edit distance with duplication and contraction of units (EDDC, for short). We select a set of frequent units that occur in given complex TRs, encode each unit as a single symbol, compress a TR into an optimal series of unit symbols that partially matches the original TR with the minimum Levenshtein distance, and estimate the EDDC between a pair of complex TRs from their compressed forms. Using substantial synthetic benchmark datasets, we demonstrate that the estimated EDDC is highly correlated with the accurate EDDC, with a Pearson correlation coefficient of >0.983, while the heuristic algorithm achieves orders of magnitude performance speedup.

The software program hEDDC that implements the proposed algorithm is available at https://github.com/Ricky-pon/hEDDC (DOI: 10.5281/zenodo.14732958)

## Full-text entities

- **Genes:** F2R (coagulation factor II thrombin receptor) [NCBI Gene 2149] {aka CF2R, HTR, PAR-1, PAR1, TR}
- **Diseases:** brain disorders (MESH:D001927)

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12014093/full.md

---
Source: https://tomesphere.com/paper/PMC12014093