Nonparametric Inference on Unlabeled Histograms
Yun Ma, Pengkun Yang

TL;DR
This paper introduces a nonparametric framework for inference on unlabeled histograms, capturing unseen domain elements and providing optimal estimators with practical benefits demonstrated through extensive experiments.
Contribution
It proposes a novel mixture distribution model for unlabeled histograms and establishes the optimal convergence rate of the NPMLE under this framework.
Findings
NPMLE achieves optimal convergence rates.
Plug-in estimators are flexible and efficient.
Experimental results validate practical advantages.
Abstract
Statistical inference on histograms and frequency counts plays a central role in categorical data analysis. Moving beyond classical methods that directly analyze labeled frequencies, we introduce a framework that models the multiset of unlabeled histograms via a mixture distribution to better capture unseen domain elements in large-alphabet regime. We study the nonparametric maximum likelihood estimator (NPMLE) under this framework, and establish its optimal convergence rate under the Poisson setting. The NPMLE also immediately yields flexible and efficient plug-in estimators for functional estimation problems, where a localized variant further achieves the optimal sample complexity for a wide range of symmetric functionals. Extensive experiments on synthetic, real-world datasets, and large language models highlight the practical benefits of the proposed method.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Methods and Mixture Models · Machine Learning and Data Classification · Speech Recognition and Synthesis
