Estimating the joint distribution of independent categorical variables   via model selection

C. Durot; E. Lebarbier; A.-S. Tocquet

arXiv:0906.2275·math.ST·June 15, 2009

Estimating the joint distribution of independent categorical variables via model selection

C. Durot, E. Lebarbier, A.-S. Tocquet

PDF

TL;DR

This paper introduces a new wavelet-based penalized least-squares estimator for nonparametrically estimating the joint distribution of independent categorical variables, with proven adaptivity and practical implementation for large datasets.

Contribution

It proposes a novel nonparametric, non-asymptotic estimator using wavelets and penalization, improving distribution estimation for large categorical datasets.

Findings

01

Estimator satisfies an oracle inequality

02

Proven to be adaptive over Besov spaces

03

Effective in segmentation and real data applications

Abstract

Assume one observes independent categorical variables or, equivalently, one observes the corresponding multinomial variables. Estimating the distribution of the observed sequence amounts to estimating the expectation of the multinomial sequence. A new estimator for this mean is proposed that is nonparametric, non-asymptotic and implementable even for large sequences. It is a penalized least-squares estimator based on wavelets, with a penalization term inspired by papers of Birg\'{e} and Massart. The estimator is proved to satisfy an oracle inequality and to be adaptive in the minimax sense over a class of Besov bodies. The method is embedded in a general framework which allows us to recover also an existing method for segmentation. Beyond theoretical results, a simulation study is reported and an application on real data is provided.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.