Data-Driven Subgroup Identification for Linear Regression
Zachary Izzo, Ruishan Liu, James Zou

TL;DR
This paper introduces DDGroup, a data-driven method for identifying subgroups within data where linear models are valid, improving interpretability and performance in heterogeneous datasets like medical studies.
Contribution
The paper presents DDGroup, a novel, interpretable, and computationally efficient approach for discovering subgroups with homogeneous linear relationships in data.
Findings
DDGroup accurately recovers regions with low-variance linear models given sufficient data.
It improves local linear model performance on real-world medical datasets.
It uncovers subgroups with different relationships missed by global parametric models.
Abstract
Medical studies frequently require to extract the relationship between each covariate and the outcome with statistical confidence measures. To do this, simple parametric models are frequently used (e.g. coefficients of linear regression) but usually fitted on the whole dataset. However, it is common that the covariates may not have a uniform effect over the whole population and thus a unified simple model can miss the heterogeneous signal. For example, a linear model may be able to explain a subset of the data but fail on the rest due to the nonlinearity and heterogeneity in the data. In this paper, we propose DDGroup (data-driven group discovery), a data-driven method to effectively identify subgroups in the data with a uniform linear relationship between the features and the label. DDGroup outputs an interpretable region in which the linear model is expected to hold. It is simple to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Biomedical Text Mining and Ontologies · Gene expression and cancer classification
Methodsfail
