TL;DR
This paper introduces GV-Rep, a large-scale dataset of 7 million genetic variants with detailed annotations, aimed at improving deep learning models for genetic variant representation and classification.
Contribution
The paper presents a comprehensive dataset for GV representation learning, including detailed annotations and real-world data, to facilitate the development of genomic foundation models.
Findings
Significant gap identified between current GFMs and accurate GV representation
Dataset enables analysis of GV structure and properties
Pre-trained GFMs show limited performance on GV tasks
Abstract
Genetic variants (GVs) are defined as differences in the DNA sequences among individuals and play a crucial role in diagnosing and treating genetic diseases. The rapid decrease in next generation sequencing cost has led to an exponential increase in patient-level GV data. This growth poses a challenge for clinicians who must efficiently prioritize patient-specific GVs and integrate them with existing genomic databases to inform patient management. To addressing the interpretation of GVs, genomic foundation models (GFMs) have emerged. However, these models lack standardized performance assessments, leading to considerable variability in model evaluations. This poses the question: How effectively do deep learning methods classify unknown GVs and align them with clinically-verified GVs? We argue that representation learning, which transforms raw data into meaningful feature spaces, is an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
MethodsALIGN
