Murmur2Vec: A Hashing Based Solution For Embedding Generation Of COVID-19 Spike Sequences
Sarwan Ali, Taslim Murad

TL;DR
Murmur2Vec introduces a hashing-based embedding method for SARS-CoV-2 spike sequences, enabling fast, scalable, and accurate lineage classification suitable for large datasets.
Contribution
The paper presents a novel hashing-based embedding technique that improves efficiency and scalability over existing methods for viral sequence analysis.
Findings
Achieves up to 86.4% classification accuracy
Reduces embedding generation time by 99.81%
Outperforms baseline and state-of-the-art methods
Abstract
Early detection and characterization of coronavirus disease (COVID-19), caused by SARS-CoV-2, remain critical for effective clinical response and public-health planning. The global availability of large-scale viral sequence data presents significant opportunities for computational analysis; however, existing approaches face notable limitations. Phylogenetic tree-based methods are computationally intensive and do not scale efficiently to today's multi-million-sequence datasets. Similarly, current embedding-based techniques often rely on aligned sequences or exhibit suboptimal predictive performance and high runtime costs, creating barriers to practical large-scale analysis. In this study, we focus on the most prevalent SARS-CoV-2 lineages associated with the spike protein region and introduce a scalable embedding method that leverages hashing to generate compact, low-dimensional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Fractal and DNA sequence analysis · SARS-CoV-2 and COVID-19 Research
