Probabilistic, statistical and algorithmic aspects of the similarity of   texts and application to Gospels comparison

Gane Samb Lo; Soumaila Dembele

arXiv:1508.03772·stat.ME·August 18, 2015·2 cites

Probabilistic, statistical and algorithmic aspects of the similarity of texts and application to Gospels comparison

Gane Samb Lo, Soumaila Dembele

PDF

Open Access

TL;DR

This paper explores probabilistic, statistical, and algorithmic methods for text similarity, applying them to compare the four Gospels using k-shinglings and approximation techniques for efficient computation.

Contribution

It introduces a combination of statistical and algorithmic approximation methods for text similarity, specifically applied to biblical texts, enhancing accuracy and efficiency.

Findings

01

Effective similarity measurement for texts using k-shinglings.

02

Approximation methods significantly reduce computation time.

03

Results on the Gospels show conclusive similarity analysis.

Abstract

The fundamental problem of similarity studies, in the frame of data-mining, is to examine and detect similar items in articles, papers, books, with huge sizes. In this paper, we are interested in the probabilistic, and the statistical and the algorithmic aspects in studies of texts. We will be using the approach of $k$ \textit{-shinglings}, a $k$ \textit{-shingling} being defined as a sequence of $k$ consecutive characters that are extracted from a text ( $k \geq 1$ ). The main stake in this field is to find accurate and quick algorithms to compute the similarity in short times. This will be achieved in using approximation methods. The first approximation method is statistical and, is based on the theorem of Glivenko-Cantelli. The second is the banding technique. And the third concerns a modification of the algorithm proposed by Rajaraman and al (% \cite{AnandJeffrey}), denoted here as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Data Management and Algorithms · Algorithms and Data Compression