GP-GPT: Large Language Model for Gene-Phenotype Mapping
Yanjun Lyu, Zihao Wu, Lu Zhang, Jing Zhang, Yiwei Li, Wei Ruan, Zhengliang Liu, Zeyu Zhang, Xiang Li, Rongjie Liu, Chao Huang, Wentao Li, Tianming Liu, Dajiang Zhu

TL;DR
GP-GPT is a specialized large language model designed for gene-phenotype mapping, demonstrating superior performance in genomics relation analysis and information retrieval compared to general LLMs.
Contribution
The paper introduces GP-GPT, the first LLM tailored for genetic-phenotype knowledge, trained on extensive genomics data, and outperforming existing models in domain-specific tasks.
Findings
GP-GPT outperforms Llama2, Llama3, and GPT-4 in genomics tasks.
It accurately retrieves medical genetics information.
Subtle bio-factor entity representations suggest new research opportunities.
Abstract
Pre-trained large language models(LLMs) have attracted increasing attention in biomedical domains due to their success in natural language processing. However, the complex traits and heterogeneity of multi-sources genomics data pose significant challenges when adapting these models to the bioinformatics and biomedical field. To address these challenges, we present GP-GPT, the first specialized large language model for genetic-phenotype knowledge representation and genomics relation analysis. Our model is fine-tuned in two stages on a comprehensive corpus composed of over 3,000,000 terms in genomics, proteomics, and medical genetics, derived from multiple large-scale validated datasets and scientific publications. GP-GPT demonstrates proficiency in accurately retrieving medical genetics information and performing common genomics analysis tasks, such as genomics information retrieval and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Machine Learning in Healthcare · Genetics, Bioinformatics, and Biomedical Research
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Softmax · Layer Normalization · Position-Wise Feed-Forward Layer · Dropout
