Massively-Parallel Feature Selection for Big Data
Ioannis Tsamardinos, Giorgos Borboudakis, Pavlos Katsogridakis, and Polyvios Pratikakis, Vassilis Christophides

TL;DR
The paper introduces PFBP, a scalable parallel feature selection algorithm for Big Data that efficiently handles high dimensionality by partitioning data and using local computations with minimal communication, ensuring soundness and scalability.
Contribution
It presents a novel parallel feature selection method that combines data partitioning, local independence testing, and heuristics for early decision-making, with theoretical guarantees and empirical validation.
Findings
Super-linear speedup with increasing sample size
Linear scalability with features and cores
Outperforms competing algorithms in efficiency
Abstract
We present the Parallel, Forward-Backward with Pruning (PFBP) algorithm for feature selection (FS) in Big Data settings (high dimensionality and/or sample size). To tackle the challenges of Big Data FS PFBP partitions the data matrix both in terms of rows (samples, training examples) as well as columns (features). By employing the concepts of -values of conditional independence tests and meta-analysis techniques PFBP manages to rely only on computations local to a partition while minimizing communication costs. Then, it employs powerful and safe (asymptotically sound) heuristics to make early, approximate decisions, such as Early Dropping of features from consideration in subsequent iterations, Early Stopping of consideration of features within the same iteration, or Early Return of the winner in each iteration. PFBP provides asymptotic guarantees of optimality for data distributions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Modeling and Causal Inference · Machine Learning and Data Classification · Face and Expression Recognition
MethodsPruning · Early Stopping
