Blind Biological Sequence Denoising with Self-Supervised Set Learning
Nathan Ng, Ji Won Park, Jae Hyeon Lee, Ryan Lewis Kelly, Stephen Ra,, Kyunghyun Cho

TL;DR
This paper introduces Self-Supervised Set Learning (SSSL), a novel method for denoising biological sequences from noisy subreads without needing clean labels, improving accuracy especially on small and difficult reads.
Contribution
The paper presents a new self-supervised approach that effectively denoises sets of biological sequences by embedding and averaging subreads, outperforming existing alignment-based methods.
Findings
Reduces errors by 17% on small reads with ≤6 subreads
Reduces errors by 8% on larger reads with >6 subreads
Significantly improves denoising on challenging small reads in real datasets
Abstract
Biological sequence analysis relies on the ability to denoise the imprecise output of sequencing platforms. We consider a common setting where a short sequence is read out repeatedly using a high-throughput long-read platform to generate multiple subreads, or noisy observations of the same sequence. Denoising these subreads with alignment-based approaches often fails when too few subreads are available or error rates are too high. In this paper, we propose a novel method for blindly denoising sets of sequences without directly observing clean source sequence labels. Our method, Self-Supervised Set Learning (SSSL), gathers subreads together in an embedding space and estimates a single set embedding as the midpoint of the subreads in both the latent and sequence spaces. This set embedding represents the "average" of the subreads and can be decoded into a prediction of the clean sequence.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGene expression and cancer classification · Machine Learning in Bioinformatics · Genomics and Phylogenetic Studies
