Faster Algorithm of String Comparison
Qi Xiao Yang, Sung Sam Yuan, Lu Chun, Li Zhao, Sun Peng

TL;DR
This paper introduces substring-based algorithms for string similarity that outperform existing token-based methods in accuracy and efficiency, achieving lower time complexity and better results in practical applications.
Contribution
The paper presents novel substring-based algorithms that improve accuracy and reduce time complexity for Field Similarity compared to prior token-based approaches.
Findings
Achieves time complexity of O(knm) with k<0.75 in worst case
Demonstrates higher accuracy through theoretical analysis and experiments
Significantly improves computation speed for string similarity tasks
Abstract
In many applications, it is necessary to determine the string similarity. Edit distance[WF74] approach is a classic method to determine Field Similarity. A well known dynamic programming algorithm [GUS97] is used to calculate edit distance with the time complexity O(nm). (for worst case, average case and even best case) Instead of continuing with improving the edit distance approach, [LL+99] adopted a brand new approach-token-based approach. Its new concept of token-base-retain the original semantic information, good time complex-O(nm) (for worst, average and best case) and good experimental performance make it a milestone paper in this area. Further study indicates that there is still room for improvement of its Field Similarity algorithm. Our paper is to introduce a package of substring-based new algorithms to determine Field Similarity. Combined together, our new algorithms not only…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Network Packet Processing and Optimization · Web Data Mining and Analysis
