# Finding low-complexity DNA sequences with longdust

**Authors:** Heng Li, Brian Li

PMC · DOI: 10.1093/bioinformatics/btag112 · 2026-03-13

## TL;DR

Longdust is a new algorithm that efficiently identifies low-complexity DNA sequences, such as satellite and tandem repeats, using a statistical model of k-mer counts.

## Contribution

Longdust introduces a novel, efficient method for identifying long low-complexity DNA sequences with a statistically defined complexity threshold.

## Key findings

- Longdust efficiently identifies long low-complexity sequences like centromeric satellites.
- The algorithm uses a statistical model of k-mer count distribution to define string complexity.
- Longdust performs well on real data and aligns consistently with existing methods.

## Abstract

Low-complexity (LC) DNA sequences are compositionally repetitive sequences that are often associated with spurious homologous matches and variant calling artifacts. While algorithms for identifying LC sequences exist, they either lack concise mathematical definition of complexity or are inefficient with long or variable context windows.

Longdust is a new algorithm that efficiently identifies long LC sequences including centromeric satellite and tandem repeats with moderately long motifs. It defines string complexity by statistically modeling the k-mer count distribution with the parameters: the k-mer length, the context window size and a threshold on complexity. Longdust exhibits high performance on real data and high consistency with existing methods.

https://github.com/lh3/longdust

## Full-text entities

- **Genes:** TERF1 (telomeric repeat binding factor 1) [NCBI Gene 7013] {aka PIN2, TRBF1, TRF, TRF1, hTRF1-AS, t-TRF1}
- **Diseases:** LC (MESH:D009800)
- **Chemicals:** T (MESH:D014316), GC (MESH:C057580)
- **Species:** Gorilla (genus) [taxon 9592], Aphelocoma woodhouseii (Woodhouse's scrub-Jay, species) [taxon 247972], Homo sapiens (human, species) [taxon 9606]
- **Mutations:** T2T
- **Cell lines:** CHM13 — Homo sapiens (Human), Hydatidiform mole, Telomerase immortalized cell line (CVCL_VU12)

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13003316/full.md

---
Source: https://tomesphere.com/paper/PMC13003316