Whole-Genome Phenotype Prediction with Machine Learning: Open Problems in Bacterial Genomics
Tamsin James, Ben Williamson, Peter Tino, Nicole Wheeler

TL;DR
This paper discusses the challenges and open problems in using machine learning for bacterial phenotype prediction from whole-genome data, emphasizing issues in identifying true causal genetic factors amidst high-dimensional and spurious correlations.
Contribution
It highlights the limitations of current ML approaches in reliably discovering causal genetic variants in bacterial genomics and outlines open research problems in the field.
Findings
High accuracy in phenotype prediction does not imply causal understanding.
Current models often identify false associations as causal features.
Open problems are identified for improving causal inference in bacterial genomics.
Abstract
How can we identify causal genetic mechanisms that govern bacterial traits? Initial efforts entrusting machine learning models to handle the task of predicting phenotype from genotype return high accuracy scores. However, attempts to extract any meaning from the predictive models are found to be corrupted by falsely identified "causal" features. Relying solely on pattern recognition and correlations is unreliable, significantly so in bacterial genomics settings where high-dimensionality and spurious associations are the norm. Though it is not yet clear whether we can overcome this hurdle, significant efforts are being made towards discovering potential high-risk bacterial genetic variants. In view of this, we set up open problems surrounding phenotype prediction from bacterial whole-genome datasets and extending those to learning causal effects, and discuss challenges that impact the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Machine Learning in Bioinformatics · Genetics, Bioinformatics, and Biomedical Research
MethodsSparse Evolutionary Training
