CGRclust: Chaos Game Representation for Twin Contrastive Clustering of Unlabelled DNA Sequences
Fatemeh Alipour, Kathleen A. Hill, Lila Kari

TL;DR
CGRclust introduces an unsupervised deep learning approach using Chaos Game Representations and twin contrastive clustering to accurately classify diverse unlabelled DNA sequences without alignment or labels.
Contribution
It is the first method to apply unsupervised twin contrastive learning to CGR images for DNA sequence clustering, outperforming existing methods.
Findings
Achieved over 81.70% accuracy on mitochondrial DNA datasets.
Consistently outperformed recent clustering methods across diverse datasets.
Demonstrated robustness and scalability across various sequence lengths and taxonomic levels.
Abstract
This study proposes CGRclust, a novel combination of unsupervised twin contrastive clustering of Chaos Game Representations (CGR) of DNA sequences, with convolutional neural networks (CNNs). To the best of our knowledge, CGRclust is the first method to use unsupervised learning for image classification (herein applied to two-dimensional CGR images) for clustering datasets of DNA sequences. CGRclust overcomes the limitations of traditional sequence classification methods by leveraging unsupervised twin contrastive learning to detect distinctive sequence patterns, without requiring DNA sequence alignment or biological/taxonomic labels. CGRclust accurately clustered twenty-five diverse datasets, with sequence lengths ranging from 664 bp to 100 kbp, including mitochondrial genomes of fish, fungi, and protists, as well as viral whole genome assemblies and synthetic DNA sequences. Compared…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFractal and DNA sequence analysis · Machine Learning in Bioinformatics · Molecular spectroscopy and chirality
MethodsContrastive Learning
