Identifying document similarity using a fast estimation of the Levenshtein Distance based on compression and signatures
Peter Coates, Frank Breitinger

TL;DR
This paper introduces a fast, signature-based method to estimate Levenshtein Distance for large documents, enabling efficient similarity detection with acceptable accuracy and a significance score for thresholding.
Contribution
The paper proposes a novel compression and signature-based approach to approximate Levenshtein Distance, reducing computational complexity for large document similarity analysis.
Findings
Signatures enable efficient comparison with reduced runtime
The method provides a good balance between speed and accuracy
A significance score helps identify related documents effectively
Abstract
Identifying document similarity has many applications, e.g., source code analysis or plagiarism detection. However, identifying similarities is not trivial and can be time complex. For instance, the Levenshtein Distance is a common metric to define the similarity between two documents but has quadratic runtime which makes it impractical for large documents where large starts with a few hundred kilobytes. In this paper, we present a novel concept that allows estimating the Levenshtein Distance: the algorithm first compresses documents to signatures (similar to hash values) using a user-defined compression ratio. Signatures can then be compared against each other (some constrains apply) where the outcome is the estimated Levenshtein Distance. Our evaluation shows promising results in terms of runtime efficiency and accuracy. In addition, we introduce a significance score allowing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Spam and Phishing Detection
