Efficient subsampling for exponential family models

Subhadra Dasgupta; Holger Dette

arXiv:2306.16821·stat.ME·March 20, 2024·Comput. Stat. Data Anal.

Efficient subsampling for exponential family models

Subhadra Dasgupta, Holger Dette

PDF

Open Access

TL;DR

This paper introduces a two-stage subsampling method for exponential family models that leverages optimal design principles and matrix distances to efficiently select informative samples, improving estimation accuracy.

Contribution

The paper presents a novel two-stage subsampling algorithm based on optimal design and matrix distances, applicable to a wide range of regression models with complex Fisher information structures.

Findings

01

Effective identification of design space via clustering

02

Optimal approximate design improves sampling efficiency

03

Method applicable to models with high-rank Fisher information

Abstract

We propose a novel two-stage subsampling algorithm based on optimal design principles. In the first stage, we use a density-based clustering algorithm to identify an approximating design space for the predictors from an initial subsample. Next, we determine an optimal approximate design on this design space. Finally, we use matrix distances such as the Procrustes, Frobenius, and square-root distance to define the remaining subsample, such that its points are "closest" to the support points of the optimal design. Our approach reflects the specific nature of the information matrix as a weighted sum of non-negative definite Fisher information matrices evaluated at the design points and applies to a large class of regression models including models where the Fisher information is of rank larger than $1$ .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGene expression and cancer classification · Statistical Methods and Inference