A Note on Coding and Standardization of Categorical Variables in (Sparse) Group Lasso Regression
Felicitas J. Detmer, Martin Slawski

TL;DR
This paper investigates the role of standardization in group lasso regression with categorical variables, showing that simple column scaling suffices instead of orthonormalization, simplifying implementation and improving performance.
Contribution
It demonstrates that column-wise scaling of the design matrix is equivalent to orthonormalization for categorical predictors in group lasso, simplifying standardization procedures.
Findings
Column scaling achieves the same effect as orthonormalization.
Proper standardization significantly improves model performance.
Extensions to sparse group lasso are also discussed.
Abstract
Categorical regressor variables are usually handled by introducing a set of indicator variables, and imposing a linear constraint to ensure identifiability in the presence of an intercept, or equivalently, using one of various coding schemes. As proposed in Yuan and Lin [J. R. Statist. Soc. B, 68 (2006), 49-67], the group lasso is a natural and computationally convenient approach to perform variable selection in settings with categorical covariates. As pointed out by Simon and Tibshirani [Stat. Sin., 22 (2011), 983-1001], "standardization" by means of block-wise orthonormalization of column submatrices each corresponding to one group of variables can substantially boost performance. In this note, we study the aspect of standardization for the special case of categorical predictors in detail. The main result is that orthonormalization is not required; column-wise scaling of the design…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
