TL;DR
This paper introduces FREEFORM, a knowledge-driven framework leveraging large language models for feature selection and engineering in genotype data, improving phenotype prediction especially in low-data scenarios.
Contribution
The paper presents a novel framework that uses pre-trained LLMs for knowledge-driven feature selection and engineering in genotype data, outperforming traditional data-driven methods.
Findings
Outperforms data-driven methods on genotype-phenotype prediction tasks
Effective in low-shot regimes with limited data
Open-source implementation available at GitHub
Abstract
Predicting phenotypes with complex genetic bases based on a small, interpretable set of variant features remains a challenging task. Conventionally, data-driven approaches are utilized for this task, yet the high dimensional nature of genotype data makes the analysis and prediction difficult. Motivated by the extensive knowledge encoded in pre-trained LLMs and their success in processing complex biomedical concepts, we set to examine the ability of LLMs in feature selection and engineering for tabular genotype data, with a novel knowledge-driven framework. We develop FREEFORM, Free-flow Reasoning and Ensembling for Enhanced Feature Output and Robust Modeling, designed with chain-of-thought and ensembling principles, to select and engineer features with the intrinsic knowledge of LLMs. Evaluated on two distinct genotype-phenotype datasets, genetic ancestry and hereditary hearing loss, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSparse Evolutionary Training · Feature Selection
