A clustering tool for nucleotide sequences using Laplacian Eigenmaps and Gaussian Mixture Models
Marine Bruneau (LMB), Thierry Mottet, Serge Moulin, Ma\"el Kerbiriou, (LMB), Franz Chouly (LMB), St\'ephane Chretien (NPL), Christophe Guyeux

TL;DR
This paper introduces a novel clustering method for nucleotide sequences combining Laplacian Eigenmaps and Gaussian Mixture Models, validated on mitochondrial DNA sequences and shown to produce phylogenetically consistent clusters.
Contribution
The paper presents a new clustering approach for nucleotide sequences that integrates Laplacian Eigenmaps with Gaussian Mixture Models, along with a publicly available Python implementation.
Findings
Clusters align with phylogenetic trees
Method is consistent with NCBI taxonomy
Validated on mitochondrial DNA sequences
Abstract
We propose a new procedure for clustering nucleotide sequences based on the "Laplacian Eigenmaps" and Gaussian Mixture modelling. This proposal is then applied to a set of 100 DNA sequences from the mitochondrially encoded NADH dehydrogenase 3 (ND3) gene of a collection of Platyhelminthes and Nematoda species. The resulting clusters are then shown to be consistent with the gene phylogenetic tree computed using a maximum likelihood approach. This comparison shows in particular that the clustering produced by the methodology combining Laplacian Eigenmaps with Gaussian Mixture models is coherent with the phylogeny as well as with the NCBI taxonomy. We also developed a Python package for this procedure which is available online.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Genetic diversity and population structure · Bayesian Methods and Mixture Models
