Optimal string clustering based on a Laplace-like mixture and EM algorithm on a set of strings
Hitoshi Koyano, Morihiro Hayashida, and Tatsuya Akutsu

TL;DR
This paper introduces a novel unsupervised string clustering method based on a Laplace-like mixture model and an EM algorithm, with proven consistency and asymptotic optimality.
Contribution
It develops a new probabilistic model for string data, constructs consistent estimators, and proposes an EM algorithm for optimal clustering.
Findings
The Laplace-like distribution on strings has well-defined properties.
The estimators for model parameters are strongly consistent.
The proposed clustering method is asymptotically optimal.
Abstract
In this study, we address the problem of clustering string data in an unsupervised manner by developing a theory of a mixture model and an EM algorithm for string data based on probability theory on a topological monoid of strings developed in our previous studies. We first construct a parametric distribution on a set of strings in the motif of the Laplace distribution on a set of real numbers and reveal its basic properties. This Laplace-like distribution has two parameters: a string that represents the location of the distribution and a positive real number that represents the dispersion. It is difficult to explicitly write maximum likelihood estimators of the parameters because their log likelihood function is a complex function, the variables of which include a string; however, we construct estimators that almost surely converge to the maximum likelihood estimators as the number of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Data Management and Algorithms · Bayesian Methods and Mixture Models
