Greedy Biomarker Discovery in the Genome with Applications to Antimicrobial Resistance
Alexandre Drouin, S\'ebastien Gigu\`ere, Maxime D\'eraspe,, Fran\c{c}ois Laviolette, Mario Marchand, Jacques Corbeil

TL;DR
This paper extends the Set Covering Machine algorithm to handle extremely high-dimensional genomic data, demonstrating its effectiveness in predicting antimicrobial resistance with superior sparsity and accuracy compared to other methods.
Contribution
The paper introduces an extension of the SCM for large-scale genomic datasets, enabling direct analysis without feature filtering, and evaluates its performance on antimicrobial resistance prediction.
Findings
SCM outperforms L1/L2 SVMs and CART in sparsity and accuracy
SCM can analyze the full feature space without filtering
SCM is computationally feasible for datasets with over 10^7 features
Abstract
The Set Covering Machine (SCM) is a greedy learning algorithm that produces sparse classifiers. We extend the SCM for datasets that contain a huge number of features. The whole genetic material of living organisms is an example of such a case, where the number of feature exceeds 10^7. Three human pathogens were used to evaluate the performance of the SCM at predicting antimicrobial resistance. Our results show that the SCM compares favorably in terms of sparsity and accuracy against L1 and L2 regularized Support Vector Machines and CART decision trees. Moreover, the SCM was the only algorithm that could consider the full feature space. For all other algorithms, the latter had to be filtered as a preprocessing step.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Gene expression and cancer classification · Text and Document Classification Technologies
