Knowledge-Driven Feature Selection and Engineering for Genotype Data   with Large Language Models

Joseph Lee; Shu Yang; Jae Young Baik; Xiaoxi Liu; Zhen Tan; Dawei Li,; Zixuan Wen; Bojian Hou; Duy Duong-Tran; Tianlong Chen; Li Shen

arXiv:2410.01795·cs.LG·April 17, 2025

Knowledge-Driven Feature Selection and Engineering for Genotype Data with Large Language Models

Joseph Lee, Shu Yang, Jae Young Baik, Xiaoxi Liu, Zhen Tan, Dawei Li,, Zixuan Wen, Bojian Hou, Duy Duong-Tran, Tianlong Chen, Li Shen

PDF

1 Repo

TL;DR

This paper introduces FREEFORM, a knowledge-driven framework leveraging large language models for feature selection and engineering in genotype data, improving phenotype prediction especially in low-data scenarios.

Contribution

The paper presents a novel framework that uses pre-trained LLMs for knowledge-driven feature selection and engineering in genotype data, outperforming traditional data-driven methods.

Findings

01

Outperforms data-driven methods on genotype-phenotype prediction tasks

02

Effective in low-shot regimes with limited data

03

Open-source implementation available at GitHub

Abstract

Predicting phenotypes with complex genetic bases based on a small, interpretable set of variant features remains a challenging task. Conventionally, data-driven approaches are utilized for this task, yet the high dimensional nature of genotype data makes the analysis and prediction difficult. Motivated by the extensive knowledge encoded in pre-trained LLMs and their success in processing complex biomedical concepts, we set to examine the ability of LLMs in feature selection and engineering for tabular genotype data, with a novel knowledge-driven framework. We develop FREEFORM, Free-flow Reasoning and Ensembling for Enhanced Feature Output and Robust Modeling, designed with chain-of-thought and ensembling principles, to select and engineer features with the intrinsic knowledge of LLMs. Evaluated on two distinct genotype-phenotype datasets, genetic ancestry and hereditary hearing loss, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pennshenlab/freeform
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSparse Evolutionary Training · Feature Selection