Contrastive Deep Learning for Variant Detection in Wastewater Genomic Sequencing
Adele Chinda, Richmond Azumah, Hemanth Demakethepalli Venkateswara

TL;DR
This paper introduces an unsupervised, reference-free deep learning framework using VQ-VAE for detecting viral variants in wastewater sequencing data, overcoming noise and coverage challenges.
Contribution
It presents a novel VQ-VAE-based method with masked pretraining and contrastive learning for robust, discrete genomic pattern representation without needing reference genomes or labels.
Findings
Achieved 99.52% token-level accuracy on SARS-CoV-2 wastewater data.
Enhanced variant clustering with contrastive fine-tuning, improving Silhouette scores by up to 42%.
Demonstrated scalable, interpretable genomic surveillance applicable to public health.
Abstract
Wastewater-based genomic surveillance has emerged as a powerful tool for population-level viral monitoring, offering comprehensive insights into circulating viral variants across entire communities. However, this approach faces significant computational challenges stemming from high sequencing noise, low viral coverage, fragmented reads, and the complete absence of labeled variant annotations. Traditional reference-based variant calling pipelines struggle with novel mutations and require extensive computational resources. We present a comprehensive framework for unsupervised viral variant detection using Vector-Quantized Variational Autoencoders (VQ-VAE) that learns discrete codebooks of genomic patterns from k-mer tokenized sequences without requiring reference genomes or variant labels. Our approach extends the base VQ-VAE architecture with masked reconstruction pretraining for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSARS-CoV-2 detection and testing · Domain Adaptation and Few-Shot Learning · Single-cell and spatial transcriptomics
