Computable Bounds and Monte Carlo Estimates of the Expected Edit   Distance

Gianfranco Bilardi; Michele Schimd

arXiv:2211.07644·cs.FL·April 9, 2024

Computable Bounds and Monte Carlo Estimates of the Expected Edit Distance

Gianfranco Bilardi, Michele Schimd

PDF

Open Access

TL;DR

This paper develops methods to compute and estimate the expected edit distance between random strings, providing bounds, algorithms, and statistical techniques to evaluate it efficiently for large string lengths and alphabet sizes.

Contribution

It introduces new bounds, algorithms, and statistical estimation methods for the expected edit distance, improving accuracy and efficiency over previous approaches.

Findings

01

Bounds on the limit of normalized expected edit distance are established.

02

A computationally intensive algorithm for exact values is presented.

03

Statistical estimates with high confidence are feasible for large string lengths.

Abstract

The edit distance is a metric of dissimilarity between strings, widely applied in computational biology, speech recognition, and machine learning. Let $e_{k} (n)$ denote the average edit distance between random, independent strings of $n$ characters from an alphabet of size $k$ . For $k \geq 2$ , it is an open problem how to efficiently compute the exact value of $α_{k} (n) = e_{k} (n) / n$ as well as of $α_{k} = lim_{n \to \infty} α_{k} (n)$ , a limit known to exist. This paper shows that $α_{k} (n) - Q (n) \leq α_{k} \leq α_{k} (n)$ , for a specific $Q (n) = Θ (lo g n / n)$ , a result which implies that $α_{k}$ is computable. The exact computation of $α_{k} (n)$ is explored, leading to an algorithm running in time $T = O (n^{2} k min (3^{n}, k^{n}))$ , a complexity that makes it of limited practical use. An analysis of statistical estimates is proposed, based on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Machine Learning and Algorithms · semigroups and automata theory