Proposal and study of statistical features for string similarity computation and classification
E.O. Rodrigues, D. Casanova, M. Teixeira, V. Pegorini, F. Favarim, E. Clua, A. Conci, Panos Liatsis

TL;DR
This paper introduces statistical features adapted from visual computing, like COM and RLM, for string similarity and classification, showing they outperform traditional measures in experiments.
Contribution
It proposes language-independent statistical features (COM and RLM) for string similarity, evaluated against existing measures, demonstrating superior performance.
Findings
COM and RLM features outperform other statistical features in synthetic experiments.
RLM features achieve the best results on a real text plagiarism dataset.
RLM and COM features are statistically more significant than other measures in most cases.
Abstract
Adaptations of features commonly applied in the field of visual computing, co-occurrence matrix (COM) and run-length matrix (RLM), are proposed for the similarity computation of strings in general (words, phrases, codes and texts). The proposed features are not sensitive to language related information. These are purely statistical and can be used in any context with any language or grammatical structure. Other statistical measures that are commonly employed in the field such as longest common subsequence, maximal consecutive longest common subsequence, mutual information and edit distances are evaluated and compared. In the first synthetic set of experiments, the COM and RLM features outperform the remaining state-of-the-art statistical features. In 3 out of 4 cases, the RLM and COM features were statistically more significant than the second best group based on distances (P-value <…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
