Vector Embeddings by Sequence Similarity and Context for Improved Compression, Similarity Search, Clustering, Organization, and Manipulation of cDNA Libraries
Daniel H. Um, David A. Knowles, Gail E. Kaiser

TL;DR
This paper introduces a vector embedding approach for cDNA sequences that enhances clustering, compression, and similarity search efficiency by transforming raw sequences into context-aware numerical representations.
Contribution
It presents a novel method for encoding gene sequences into vector embeddings based on sequence similarity and context, improving clustering, compression, and search tasks.
Findings
Enhanced clustering of gene sequences using vector embeddings.
Improved compression performance for cDNA libraries.
Faster similarity searches through Euclidean space proximity algorithms.
Abstract
This paper demonstrates the utility of organized numerical representations of genes in research involving flat string gene formats (i.e., FASTA/FASTQ5). FASTA/FASTQ files have several current limitations, such as their large file sizes, slow processing speeds for mapping and alignment, and contextual dependencies. These challenges significantly hinder investigations and tasks that involve finding similar sequences. The solution lies in transforming sequences into an alternative representation that facilitates easier clustering into similar groups compared to the raw sequences themselves. By assigning a unique vector embedding to each short sequence, it is possible to more efficiently cluster and improve upon compression performance for the string representations of cDNA libraries. Furthermore, through learning alternative coordinate vector embeddings based on the contexts of codon…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Gene expression and cancer classification · DNA and Biological Computing
