De Bruijn entropy and string similarity

Steve Huntsman; Arman Rezaee

arXiv:1509.02975·cs.DM·January 24, 2022·2 cites

De Bruijn entropy and string similarity

Steve Huntsman, Arman Rezaee

PDF

Open Access

TL;DR

This paper introduces de Bruijn entropy for Eulerian quivers to measure string similarity, demonstrating superior performance over traditional edit distances and providing practical applications like molecular phylogenetics.

Contribution

It presents a novel entropy-based method for string similarity that links combinatorial and information-theoretical properties, with tunable computational complexity.

Findings

01

Outperforms edit distances in many scenarios

02

Complexity tunable between linear and cubic

03

Effective in molecular phylogenetics applications

Abstract

We introduce the notion of de Bruijn entropy of an Eulerian quiver and show how the corresponding relative entropy can be applied to practical string similarity problems. This approach explicitly links the combinatorial and information-theoretical properties of words and its performance is superior to edit distances in many respects and competitive in most others. The computational complexity of our current implementation is parametrically tunable between linear and cubic, and we outline how an optimized linear algebra subroutine can reduce the cubic complexity to approximately linear. Numerous examples are provided, including a realistic application to molecular phylogenetics.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Machine Learning in Bioinformatics · Genomics and Phylogenetic Studies