Probabilistic, statistical and algorithmic aspects of the similarity of texts and application to Gospels comparison
Gane Samb Lo, Soumaila Dembele

TL;DR
This paper explores probabilistic, statistical, and algorithmic methods for text similarity, applying them to compare the four Gospels using k-shinglings and approximation techniques for efficient computation.
Contribution
It introduces a combination of statistical and algorithmic approximation methods for text similarity, specifically applied to biblical texts, enhancing accuracy and efficiency.
Findings
Effective similarity measurement for texts using k-shinglings.
Approximation methods significantly reduce computation time.
Results on the Gospels show conclusive similarity analysis.
Abstract
The fundamental problem of similarity studies, in the frame of data-mining, is to examine and detect similar items in articles, papers, books, with huge sizes. In this paper, we are interested in the probabilistic, and the statistical and the algorithmic aspects in studies of texts. We will be using the approach of \textit{-shinglings}, a \textit{-shingling} being defined as a sequence of consecutive characters that are extracted from a text ( ). The main stake in this field is to find accurate and quick algorithms to compute the similarity in short times. This will be achieved in using approximation methods. The first approximation method is statistical and, is based on the theorem of Glivenko-Cantelli. The second is the banding technique. And the third concerns a modification of the algorithm proposed by Rajaraman and al (% \cite{AnandJeffrey}), denoted here as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Data Management and Algorithms · Algorithms and Data Compression
