Finding low-complexity DNA sequences with longdust

Heng Li; Brian Li

PMC · DOI:10.1093/bioinformatics/btag112·March 13, 2026

Finding low-complexity DNA sequences with longdust

Heng Li, Brian Li

PDF

Open Access

TL;DR

Longdust is a new algorithm that efficiently identifies low-complexity DNA sequences, such as satellite and tandem repeats, using a statistical model of k-mer counts.

Contribution

Longdust introduces a novel, efficient method for identifying long low-complexity DNA sequences with a statistically defined complexity threshold.

Findings

01

Longdust efficiently identifies long low-complexity sequences like centromeric satellites.

02

The algorithm uses a statistical model of k-mer count distribution to define string complexity.

03

Longdust performs well on real data and aligns consistently with existing methods.

Abstract

Low-complexity (LC) DNA sequences are compositionally repetitive sequences that are often associated with spurious homologous matches and variant calling artifacts. While algorithms for identifying LC sequences exist, they either lack concise mathematical definition of complexity or are inefficient with long or variable context windows. Longdust is a new algorithm that efficiently identifies long LC sequences including centromeric satellite and tandem repeats with moderately long motifs. It defines string complexity by statistically modeling the k-mer count distribution with the parameters: the k-mer length, the context window size and a threshold on complexity. Longdust exhibits high performance on real data and high consistency with existing methods. https://github.com/lh3/longdust

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Genes1

TERF1

Proteins1

Species3

Gorilla(genus)Aphelocoma woodhouseii(Woodhouse's scrub-Jay · species)Homo sapiens(human · species)

Cell lines1

CHM13— Homo sapiens (Human) · Hydatidiform mole · Telomerase immortalized cell line

Chemicals2

T GC

Diseases1

LC

Mutations1

T2T

Figures3

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFractal and DNA sequence analysis · DNA and Biological Computing · Genome Rearrangement Algorithms