The exact probability law for the approximated similarity from the Minhashing method
Soumaila Dembele, Gane Samb Lo

TL;DR
This paper establishes the exact probability law for the similarity estimates produced by Minhashing algorithms, specifically the RU and RUM methods, providing a theoretical foundation for their use in large text similarity tasks.
Contribution
It introduces a probabilistic framework that characterizes the distribution of similarity estimates from Minhash algorithms, linking the expected value to the true similarity under certain conditions.
Findings
Exact similarity equals the expected random similarity in ideal cases.
Provides a theoretical basis for the validity of Minhash-based similarity estimation.
Extends the analysis to modified versions of the algorithm.
Abstract
We propose a probabilistic setting in which we study the probability law of the Rajaraman and Ullman \textit{RU} algorithm and a modified version of it denoted by \textit{RUM}. These algorithms aim at estimating the similarity index between huge texts in the context of the web. We give a foundation of this method by showing, in the ideal case of carefully chosen probability laws, the exact similarity is the mathematical expectation of the random similarity provided by the algorithm. Some extensions are given. \noindent \textbf{R\'{e}sum\'{e}.} Nous proposons un cadre probabilistique dans lequel nous \'{e}tudions la loi de probabilit\'{e} de l'algorithme de Rajaraman et Ullman \textit{RU} ainsi qu'une version modifi\'{e}e de cet algorithme not\'{e}e \textit{RUM}. Ces alogrithmes visent \`{a} estimer l'indice de la similarit\'{e} entre des textes de grandes tailles dans le contexte du…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
