TL;DR
This paper introduces the Burrows Wheeler Markov Distance (BWMD), a novel, fixed-length embedding-based metric that improves clustering of DNA sequences and malware, overcoming limitations of previous compression-inspired methods.
Contribution
The paper presents BWMD, a new distance measure that embeds sequences into fixed-length vectors, enhancing clustering performance and applicability across bioinformatics and cybersecurity domains.
Findings
BWMD outperforms previous compression-based distances in clustering accuracy.
BWMD effectively handles variable-length DNA sequences.
BWMD improves malware classification on large datasets.
Abstract
Prior work inspired by compression algorithms has described how the Burrows Wheeler Transform can be used to create a distance measure for bioinformatics problems. We describe issues with this approach that were not widely known, and introduce our new Burrows Wheeler Markov Distance (BWMD) as an alternative. The BWMD avoids the shortcomings of earlier efforts, and allows us to tackle problems in variable length DNA sequence clustering. BWMD is also more adaptable to other domains, which we demonstrate on malware classification tasks. Unlike other compression-based distance metrics known to us, BWMD works by embedding sequences into a fixed-length feature vector. This allows us to provide significantly improved clustering performance on larger malware corpora, a weakness of prior methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
