Correlated variables in regression: clustering and sparse estimation
Peter B\"uhlmann, Philipp R\"utimann, Sara van de Geer, Cun-Hui, Zhang

TL;DR
This paper introduces a novel clustering method based on canonical correlations for high-dimensional regression with correlated variables, improving estimation accuracy and variable detection.
Contribution
It proposes a new agglomerative clustering algorithm and demonstrates its statistical consistency and benefits for sparse estimation in correlated variable settings.
Findings
The clustering algorithm finds an optimal, consistent solution.
Canonical correlation clustering improves the design matrix's compatibility constant.
Empirical results show enhanced prediction and variable detection.
Abstract
We consider estimation in a high-dimensional linear model with strongly correlated variables. We propose to cluster the variables first and do subsequent sparse estimation such as the Lasso for cluster-representatives or the group Lasso based on the structure from the clusters. Regarding the first step, we present a novel and bottom-up agglomerative clustering algorithm based on canonical correlations, and we show that it finds an optimal solution and is statistically consistent. We also present some theoretical arguments that canonical correlation based clustering leads to a better-posed compatibility constant for the design matrix which ensures identifiability and an oracle inequality for the group Lasso. Furthermore, we discuss circumstances where cluster-representatives and using the Lasso as subsequent estimator leads to improved results for prediction and detection of variables.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
