dna2vec: Consistent vector representations of variable-length k-mers

Patrick Ng

arXiv:1701.06279·q-bio.QM·January 24, 2017·147 cites

dna2vec: Consistent vector representations of variable-length k-mers

Patrick Ng

PDF

Open Access 2 Repos

TL;DR

This paper introduces dna2vec, a method for generating consistent, distributed vector representations of variable-length DNA k-mers using a word2vec-inspired model, addressing limitations of one-hot encoding.

Contribution

The paper presents a novel approach to embed variable-length DNA k-mers into continuous vector space, enabling better similarity measures for biological sequence analysis.

Findings

01

dna2vec vectors can be summed to approximate nucleotide concatenation

02

Cosine similarity of dna2vec vectors correlates with Needleman-Wunsch scores

03

Addresses curse of dimensionality in DNA sequence representation

Abstract

One of the ubiquitous representation of long DNA sequence is dividing it into shorter k-mer components. Unfortunately, the straightforward vector encoding of k-mer as a one-hot vector is vulnerable to the curse of dimensionality. Worse yet, the distance between any pair of one-hot vectors is equidistant. This is particularly problematic when applying the latest machine learning algorithms to solve problems in biological sequence analysis. In this paper, we propose a novel method to train distributed representations of variable-length k-mers. Our method is based on the popular word embedding model word2vec, which is trained on a shallow two-layer neural network. Our experiments provide evidence that the summing of dna2vec vectors is akin to nucleotides concatenation. We also demonstrate that there is correlation between Needleman-Wunsch similarity score and cosine similarity of dna2vec…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Software Engineering Research