Learning Genomic Structure from $k$-mers
Filip Thor, Carl Nettelblad

TL;DR
This paper introduces a contrastive learning method for genomic data that creates embeddings capturing genome structure, enabling improved read mapping, structural variation detection, and metagenomic classification without full genome assembly.
Contribution
The authors develop a novel contrastive learning framework for $k$-mer sequences that preserves genomic structure in embeddings, applicable to various genomic analysis tasks.
Findings
Embeddings accurately reflect genomic structure and trajectories.
Comparable performance to BWA-aln in ancient DNA read mapping.
Scalable approach suitable for large genomes and metagenomic data.
Abstract
Sequencing a genome to determine an individual's DNA produces an enormous number of short nucleotide subsequences known as reads, which must be reassembled to reconstruct the full genome. We present a method for analyzing this type of data using contrastive learning, in which an encoder model is trained to produce embeddings that cluster together sequences from the same genomic region. The sequential nature of genomic regions is preserved in the form of trajectories through this embedding space. Trained solely to reflect the structure of the genome, the resulting model provides a general representation of -mer sequences, suitable for a range of downstream tasks involving read data. We apply our framework to learn the structure of the genome, and demonstrate its use in simulated ancient DNA (aDNA) read mapping and identification of structural variations. Furthermore, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFractal and DNA sequence analysis · Genomics and Phylogenetic Studies · Genome Rearrangement Algorithms
MethodsContrastive Learning
