A Method for Finding Similar Documents Relying on Adding Repetition of Symbols in Length Based Filtering
Hossein Azgomi, Masumeh Ghasemi Mahsayeh, Masoud Mohammadi, Milad, Moradi

TL;DR
This paper introduces a new document similarity method that considers symbol repetition, aiming to improve accuracy and efficiency over existing length-based filtering and shingling techniques.
Contribution
It proposes a novel approach that incorporates symbol repetition into length-based filtering to enhance document similarity detection and reduce computational comparisons.
Findings
Improved accuracy in document similarity measurement.
Reduced number of comparisons needed.
Faster processing time for large datasets.
Abstract
A basic topic in mining of massive dataset is finding similar items. As an example, finding similar documents can be recommended. In this case many methods are existed. For example, Shingling method and length based filtering are one of them. In Shingling method, from each document, substrings have been selected with symbol name and, they are placed on one set. For finding similar documents, the similarities of sets that related with them have been calculated. In Length based filtering just documents which close these lengths have been compared. These methods don't consider repetition of symbols. With considering the repetition can calculate length of documents with more accurately. In this paper we suggested a method for finding similar documents with considering the repetition of symbols. This method separated documents to better form. The main goal of this paper is presentation a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Data Mining Algorithms and Applications · Web Data Mining and Analysis
