Diffusion-Driven High-Dimensional Variable Selection
Minjie Wang, Xiaotong Shen, Wei Pan

TL;DR
This paper introduces a diffusion model-based resampling framework for stable and reliable variable selection in high-dimensional, correlated data, improving accuracy and interpretability over existing methods.
Contribution
It proposes a novel diffusion-driven synthetic data aggregation approach that enhances variable selection stability and incorporates transfer learning, with theoretical guarantees and broad applicability.
Findings
Outperforms lasso, stability selection, and knockoffs in simulations.
Achieves higher true-positive rates and lower false discoveries.
Provides valid confidence intervals and hypothesis tests.
Abstract
Variable selection for high-dimensional, highly correlated data has long been a challenging problem, often yielding unstable and unreliable models. We propose a resample-aggregate framework that exploits diffusion models' ability to generate high-fidelity synthetic data. Specifically, we draw multiple pseudo-data sets from a diffusion model fitted to the original data, apply any off-the-shelf selector (e.g., lasso or SCAD), and store the resulting inclusion indicators and coefficients. Aggregating across replicas produces a stable subset of predictors with calibrated stability scores for variable selection. Theoretically, we show that the proposed method is selection consistent under mild assumptions. Because the generative model imports knowledge from large pre-trained weights, the procedure naturally benefits from transfer learning, boosting power when the observed sample is small or…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
