GV-Rep: A Large-Scale Dataset for Genetic Variant Representation   Learning

Zehui Li; Vallijah Subasri; Guy-Bart Stan; Yiren Zhao; Bo Wang

arXiv:2407.16940·cs.LG·December 6, 2024

GV-Rep: A Large-Scale Dataset for Genetic Variant Representation Learning

Zehui Li, Vallijah Subasri, Guy-Bart Stan, Yiren Zhao, Bo Wang

PDF

1 Repo 1 Video

TL;DR

This paper introduces GV-Rep, a large-scale dataset of 7 million genetic variants with detailed annotations, aimed at improving deep learning models for genetic variant representation and classification.

Contribution

The paper presents a comprehensive dataset for GV representation learning, including detailed annotations and real-world data, to facilitate the development of genomic foundation models.

Findings

01

Significant gap identified between current GFMs and accurate GV representation

02

Dataset enables analysis of GV structure and properties

03

Pre-trained GFMs show limited performance on GV tasks

Abstract

Genetic variants (GVs) are defined as differences in the DNA sequences among individuals and play a crucial role in diagnosing and treating genetic diseases. The rapid decrease in next generation sequencing cost has led to an exponential increase in patient-level GV data. This growth poses a challenge for clinicians who must efficiently prioritize patient-specific GVs and integrate them with existing genomic databases to inform patient management. To addressing the interpretation of GVs, genomic foundation models (GFMs) have emerged. However, these models lack standardized performance assessments, leading to considerable variability in model evaluations. This poses the question: How effectively do deep learning methods classify unknown GVs and align them with clinically-verified GVs? We argue that representation learning, which transforms raw data into meaningful feature spaces, is an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bowang-lab/genomic-fm
pytorchOfficial

Videos

GV-Rep: A Large-Scale Dataset for Genetic Variant Representation Learning· slideslive

Taxonomy

MethodsALIGN