A Statistical View of Column Subset Selection

Anav Sood; Trevor Hastie

arXiv:2307.12892·stat.ME·May 20, 2025·2 cites

A Statistical View of Column Subset Selection

Anav Sood, Trevor Hastie

PDF

Open Access 1 Repo

TL;DR

This paper unifies the computer science and statistical perspectives on column subset selection, showing their equivalence and providing methods for efficient, robust subset selection in high-dimensional data.

Contribution

It demonstrates the equivalence of CSS and principal variables, frames both as maximum likelihood estimation, and develops practical methods for high-dimensional, incomplete, or censored data.

Findings

01

CSS and principal variables are equivalent approaches.

02

Efficient CSS using only summary statistics is possible.

03

CSS methods are robust to missing and censored data.

Abstract

We consider the problem of selecting a small subset of representative variables from a large dataset. In the computer science literature, this dimensionality reduction problem is typically formalized as Column Subset Selection (CSS). Meanwhile, the typical statistical formalization is to find an information-maximizing set of Principal Variables. This paper shows that these two approaches are equivalent, and moreover, both can be viewed as maximum likelihood estimation within a certain semi-parametric model. Within this model, we establish suitable conditions under which the CSS estimate is consistent in high dimensions, specifically in the proportional asymptotic regime where the number of variables over the sample size converges to a constant. Using these connections, we show how to efficiently (1) perform CSS using only summary statistics from the original dataset; (2) perform CSS in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

anavsood/css
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Modeling and Causal Inference · Machine Learning and Data Classification · Data Mining Algorithms and Applications