Self-Distillation Improves DNA Sequence Inference
Tong Yu, Lei Cheng, Ruslan Khalitov, Erland Brandser Olsson, Zhirong, Yang

TL;DR
This paper introduces a novel self-distillation neural network model that improves DNA sequence inference by combining masked learning and contrastive learning, effectively leveraging both individual sequence context and population distribution.
Contribution
The paper presents a new deep neural network with collaborative student-teacher learning and contrastive methods, specifically designed for DNA sequences, enhancing inference accuracy over existing SSP approaches.
Findings
Significant performance improvements across 20 downstream tasks.
Effective integration of contextual and distributional information.
Pretraining on human genome enhances downstream inference.
Abstract
Self-supervised pretraining (SSP) has been recognized as a method to enhance prediction accuracy in various downstream tasks. However, its efficacy for DNA sequences remains somewhat constrained. This limitation stems primarily from the fact that most existing SSP approaches in genomics focus on masked language modeling of individual sequences, neglecting the crucial aspect of encoding statistics across multiple sequences. To overcome this challenge, we introduce an innovative deep neural network model, which incorporates collaborative learning between a `student' and a `teacher' subnetwork. In this model, the student subnetwork employs masked learning on nucleotides and progressively adapts its parameters to the teacher subnetwork through an exponential moving average approach. Concurrently, both subnetworks engage in contrastive learning, deriving insights from two augmented…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Machine Learning in Bioinformatics
MethodsFocus
