Scalable and Accurate Online Feature Selection for Big Data
Kui Yu, Xindong Wu, Wei Ding, and Jian Pei

TL;DR
This paper introduces SAOLA and group-SAOLA, scalable online feature selection algorithms designed for high-dimensional big data, demonstrating superior performance and scalability through empirical evaluations.
Contribution
The paper proposes novel online feature selection algorithms, SAOLA and group-SAOLA, capable of handling extremely high-dimensional data efficiently and maintaining sparse feature sets over time.
Findings
SAOLA and group-SAOLA are scalable to very high-dimensional data.
They outperform existing feature selection methods in accuracy and efficiency.
The algorithms effectively maintain sparse feature sets both at group and individual feature levels.
Abstract
Feature selection is important in many big data applications. Two critical challenges closely associate with big data. Firstly, in many big data applications, the dimensionality is extremely high, in millions, and keeps growing. Secondly, big data applications call for highly scalable feature selection algorithms in an online manner such that each feature can be processed in a sequential scan. We present SAOLA, a Scalable and Accurate OnLine Approach for feature selection in this paper. With a theoretical analysis on bounds of the pairwise correlations between features, SAOLA employs novel pairwise comparison techniques and maintain a parsimonious model over time in an online manner. Furthermore, to deal with upcoming features that arrive by groups, we extend the SAOLA algorithm, and then propose a new group-SAOLA algorithm for online group feature selection. The group-SAOLA algorithm…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace and Expression Recognition · Gene expression and cancer classification · Machine Learning and Data Classification
