Estimating the joint distribution of independent categorical variables via model selection
C. Durot, E. Lebarbier, A.-S. Tocquet

TL;DR
This paper introduces a new wavelet-based penalized least-squares estimator for nonparametrically estimating the joint distribution of independent categorical variables, with proven adaptivity and practical implementation for large datasets.
Contribution
It proposes a novel nonparametric, non-asymptotic estimator using wavelets and penalization, improving distribution estimation for large categorical datasets.
Findings
Estimator satisfies an oracle inequality
Proven to be adaptive over Besov spaces
Effective in segmentation and real data applications
Abstract
Assume one observes independent categorical variables or, equivalently, one observes the corresponding multinomial variables. Estimating the distribution of the observed sequence amounts to estimating the expectation of the multinomial sequence. A new estimator for this mean is proposed that is nonparametric, non-asymptotic and implementable even for large sequences. It is a penalized least-squares estimator based on wavelets, with a penalization term inspired by papers of Birg\'{e} and Massart. The estimator is proved to satisfy an oracle inequality and to be adaptive in the minimax sense over a class of Besov bodies. The method is embedded in a general framework which allows us to recover also an existing method for segmentation. Beyond theoretical results, a simulation study is reported and an application on real data is provided.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
