Biological Sequence Clustering: A Survey
Simeng Zhang, Xinying Liu, Jun Lou, Mudi Jiang, Quan Zou, Zengyou He

TL;DR
This survey comprehensively reviews biological sequence clustering methods, discussing their strategies, paradigms, objectives, and challenges to guide future research in large-scale bioinformatics analysis.
Contribution
It provides a detailed overview of existing algorithms, categorizing them by similarity modeling, clustering paradigms, and objectives, highlighting trade-offs and future directions.
Findings
Summarizes main strategies for modeling sequence similarity.
Classifies clustering paradigms and discusses their trade-offs.
Identifies current limitations and future challenges.
Abstract
The rapid development of high-throughput sequencing technologies has led to an explosive increase in biological sequence data, making sequence clustering a fundamental task in large-scale bioinformatics analyses. Unlike traditional clustering problems, biological sequence clustering faces unique challenges due to the lack of direct similarity measures, strict biological constraints, and demanding requirements for both scalability and accuracy. Over the past decades, a wide variety of methods have been developed, differing in how they model sequence similarity, construct clusters, and prioritize optimization objectives. In this review, we provide a comprehensive methodological overview of biological sequence clustering algorithms. We begin by summarizing the main strategies for modeling sequence similarity, which can be divided into three stages: sequence encoding, feature generation,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBioinformatics and Genomic Networks · Gene expression and cancer classification · Genomics and Phylogenetic Studies
