Nonparametric Inference on Unlabeled Histograms

Yun Ma; Pengkun Yang

arXiv:2511.05077·math.ST·November 10, 2025

Nonparametric Inference on Unlabeled Histograms

Yun Ma, Pengkun Yang

PDF

Open Access

TL;DR

This paper introduces a nonparametric framework for inference on unlabeled histograms, capturing unseen domain elements and providing optimal estimators with practical benefits demonstrated through extensive experiments.

Contribution

It proposes a novel mixture distribution model for unlabeled histograms and establishes the optimal convergence rate of the NPMLE under this framework.

Findings

01

NPMLE achieves optimal convergence rates.

02

Plug-in estimators are flexible and efficient.

03

Experimental results validate practical advantages.

Abstract

Statistical inference on histograms and frequency counts plays a central role in categorical data analysis. Moving beyond classical methods that directly analyze labeled frequencies, we introduce a framework that models the multiset of unlabeled histograms via a mixture distribution to better capture unseen domain elements in large-alphabet regime. We study the nonparametric maximum likelihood estimator (NPMLE) under this framework, and establish its optimal convergence rate under the Poisson setting. The NPMLE also immediately yields flexible and efficient plug-in estimators for functional estimation problems, where a localized variant further achieves the optimal sample complexity for a wide range of symmetric functionals. Extensive experiments on synthetic, real-world datasets, and large language models highlight the practical benefits of the proposed method.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Methods and Mixture Models · Machine Learning and Data Classification · Speech Recognition and Synthesis