De Bruijn entropy and string similarity
Steve Huntsman, Arman Rezaee

TL;DR
This paper introduces de Bruijn entropy for Eulerian quivers to measure string similarity, demonstrating superior performance over traditional edit distances and providing practical applications like molecular phylogenetics.
Contribution
It presents a novel entropy-based method for string similarity that links combinatorial and information-theoretical properties, with tunable computational complexity.
Findings
Outperforms edit distances in many scenarios
Complexity tunable between linear and cubic
Effective in molecular phylogenetics applications
Abstract
We introduce the notion of de Bruijn entropy of an Eulerian quiver and show how the corresponding relative entropy can be applied to practical string similarity problems. This approach explicitly links the combinatorial and information-theoretical properties of words and its performance is superior to edit distances in many respects and competitive in most others. The computational complexity of our current implementation is parametrically tunable between linear and cubic, and we outline how an optimized linear algebra subroutine can reduce the cubic complexity to approximately linear. Numerous examples are provided, including a realistic application to molecular phylogenetics.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Machine Learning in Bioinformatics · Genomics and Phylogenetic Studies
