Statistical Test for Feature Selection Pipelines by Selective Inference
Tomohiro Shiraishi, Tatsuya Matsukawa, Shuichi Nishino, Ichiro, Takeuchi

TL;DR
This paper introduces a new statistical test based on selective inference to evaluate the significance of feature selection pipelines, ensuring valid false positive control in data analysis workflows.
Contribution
It develops a general framework for statistically testing feature selection pipelines, applicable to various components, with theoretical guarantees and practical implementation.
Findings
The test controls false positive rates at any desired level.
Experimental validation on synthetic and real data confirms effectiveness.
Framework enables testing of diverse pipeline configurations without extra costs.
Abstract
A data analysis pipeline is a structured sequence of steps that transforms raw data into meaningful insights by integrating various analysis algorithms. In this paper, we propose a novel statistical test to assess the significance of data analysis pipelines in feature selection problems. Our approach enables the systematic development of valid statistical tests applicable to any feature selection pipeline composed of predefined components. We develop this framework based on selective inference, a statistical technique that has recently gained attention for data-driven hypotheses. As a proof of concept, we consider feature selection pipelines for linear models, composed of three missing value imputation algorithms, three outlier detection algorithms, and three feature selection algorithms. We theoretically prove that our statistical test can control the probability of false positive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Statistical Methods and Models · Fault Detection and Control Systems
MethodsSoftmax · Attention Is All You Need · Feature Selection
