Geneverse: A collection of Open-source Multimodal Large Language Models   for Genomic and Proteomic Research

Tianyu Liu; Yijia Xiao; Xiao Luo; Hua Xu; W. Jim Zheng; Hongyu Zhao

arXiv:2406.15534·cs.LG·September 25, 2024·1 cites

Geneverse: A collection of Open-source Multimodal Large Language Models for Genomic and Proteomic Research

Tianyu Liu, Yijia Xiao, Xiao Luo, Hua Xu, W. Jim Zheng, Hongyu Zhao

PDF

Open Access 1 Repo 1 Video

TL;DR

Geneverse introduces open-source multimodal large language models tailored for genomic and proteomic research, enabling novel biomedical applications with improved performance and accessibility.

Contribution

We developed and evaluated a collection of fine-tuned and multimodal LLMs specifically for genomics and proteomics, addressing a research gap in open-source biomedical AI tools.

Findings

01

Models perform well on gene and protein function tasks

02

Outperform some closed-source models in truthfulness and correctness

03

All training resources are freely accessible

Abstract

The applications of large language models (LLMs) are promising for biomedical and healthcare research. Despite the availability of open-source LLMs trained using a wide range of biomedical data, current research on the applications of LLMs to genomics and proteomics is still limited. To fill this gap, we propose a collection of finetuned LLMs and multimodal LLMs (MLLMs), known as Geneverse, for three novel tasks in genomic and proteomic research. The models in Geneverse are trained and evaluated based on domain-specific datasets, and we use advanced parameter-efficient finetuning techniques to achieve the model adaptation for tasks including the generation of descriptions for gene functions, protein function inference from its structure, and marker gene selection from spatial transcriptomic data. We demonstrate that adapted LLMs and MLLMs perform well for these tasks and may outperform…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

HelloWorldLTY/Geneverse
pytorchOfficial

Videos

Geneverse: A Collection of Open-source Multimodal Large Language Models for Genomic and Proteomic Research· underline

Taxonomy

TopicsGenetics, Bioinformatics, and Biomedical Research

MethodsBalanced Selection