
TL;DR
This paper introduces the harmonic indel distance (HID), a novel string distance metric with an inversely proportional insertion/deletion cost, and compares its performance and geometric properties to existing indel distances on biomedical data.
Contribution
The paper proposes the harmonic indel distance with a closed-form formula and demonstrates its metric properties and advantages through experimental comparisons and geometric analysis.
Findings
HID is a proper metric distance.
HID outperforms normalized and unnormalized indel distances on benchmarks.
Planar embeddings reveal geometric differences between metrics.
Abstract
This short note introduces the harmonic indel distance (HID), a new distance between strings where the cost of an insertion or deletion is inversely proportional to the string length. We present a closed-form formula and show that the HID is a proper distance metric. Then we perform an experimental comparison of HID to normalized and unnormalized versions of the indel distance on benchmark tasks for biomedical sequence data. We finally show planar embeddings of the benchmark datasets to provide some insights into the geometry of the metric spaces associated with the different distance metrics.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Music and Audio Processing · Natural Language Processing Techniques
