Reads2Vec: Efficient Embedding of Raw High-Throughput Sequencing Reads Data
Prakash Chourasia, Sarwan Ali, Simone Ciccolella, Gianluca Della, Vedova, Murray Patterson

TL;DR
Reads2Vec introduces an alignment-free method to generate fixed-length embeddings directly from raw sequencing reads, enabling efficient classification and clustering of SARS-CoV-2 data without the need for genome assembly.
Contribution
The paper presents a novel alignment-free embedding technique for raw sequencing reads that improves classification and clustering over existing methods.
Findings
Better classification accuracy on simulated data
Improved clustering properties compared to existing tools
Spike region significantly influences clustering results
Abstract
The massive amount of genomic data appearing for SARS-CoV-2 since the beginning of the COVID-19 pandemic has challenged traditional methods for studying its dynamics. As a result, new methods such as Pangolin, which can scale to the millions of samples of SARS-CoV-2 currently available, have appeared. Such a tool is tailored to take as input assembled, aligned and curated full-length sequences, such as those found in the GISAID database. As high-throughput sequencing technologies continue to advance, such assembly, alignment and curation may become a bottleneck, creating a need for methods which can process raw sequencing reads directly. In this paper, we propose Reads2Vec, an alignment-free embedding approach that can generate a fixed-length feature vector representation directly from the raw sequencing reads without requiring assembly. Furthermore, since such an embedding is a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Genomics and Phylogenetic Studies · COVID-19 diagnosis using AI
