Effects of Additional Data on Bayesian Clustering
Keisuke Yamazaki

TL;DR
This paper provides a theoretical analysis of how additional data impacts the accuracy of Bayesian clustering models, highlighting the benefits and drawbacks of increased data and model complexity.
Contribution
It offers a novel theoretical framework to understand the effects of extra data on the accuracy of hierarchical probabilistic clustering models.
Findings
Additional data can improve latent variable estimation accuracy.
Increased model complexity may reduce accuracy due to higher dimensionality.
Theoretical insights clarify when extra data is beneficial or detrimental.
Abstract
Hierarchical probabilistic models, such as mixture models, are used for cluster analysis. These models have two types of variables: observable and latent. In cluster analysis, the latent variable is estimated, and it is expected that additional information will improve the accuracy of the estimation of the latent variable. Many proposed learning methods are able to use additional data; these include semi-supervised learning and transfer learning. However, from a statistical point of view, a complex probabilistic model that encompasses both the initial and additional data might be less accurate due to having a higher-dimensional parameter. The present paper presents a theoretical analysis of the accuracy of such a model and clarifies which factor has the greatest effect on its accuracy, the advantages of obtaining additional data, and the disadvantages of increasing the complexity.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Methods and Mixture Models · Bayesian Modeling and Causal Inference · Statistical Methods and Bayesian Inference
