Advancing bioinformatics with large language models: components,   applications and perspectives

Jiajia Liu; Mengyuan Yang; Yankai Yu; Haixia Xu; Tiangang Wang; Kang; Li; Xiaobo Zhou

arXiv:2401.04155·q-bio.QM·February 4, 2025·27 cites

Advancing bioinformatics with large language models: components, applications and perspectives

Jiajia Liu, Mengyuan Yang, Yankai Yu, Haixia Xu, Tiangang Wang, Kang, Li, Xiaobo Zhou

PDF

Open Access

TL;DR

This review explores how large language models are transforming bioinformatics by detailing their components, applications across various biological data types, and offering practical guidance for their effective use and development.

Contribution

It provides a comprehensive overview of LLM components, applications in bioinformatics, and practical strategies, highlighting their potential beyond natural language processing.

Findings

01

LLMs are effective in genomics, transcriptomics, proteomics, drug discovery, and single-cell analysis.

02

Current foundation models enable diverse bioinformatics applications.

03

Guidelines are provided for optimizing LLM use and fostering innovation.

Abstract

Large language models (LLMs) are a class of artificial intelligence models based on deep learning, which have great performance in various tasks, especially in natural language processing (NLP). Large language models typically consist of artificial neural networks with numerous parameters, trained on large amounts of unlabeled input using self-supervised or semi-supervised learning. However, their potential for solving bioinformatics problems may even exceed their proficiency in modeling human language. In this review, we will provide a comprehensive overview of the essential components of large language models (LLMs) in bioinformatics, spanning genomics, transcriptomics, proteomics, drug discovery, and single-cell analysis. Key aspects covered include tokenization methods for diverse data types, the architecture of transformer models, the core attention mechanism, and the pre-training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRNA modifications and cancer · Machine Learning in Bioinformatics · Topic Modeling

MethodsMulti-Head Attention · Attention Is All You Need · Cosine Annealing · Discriminative Fine-Tuning · Linear Layer · Attention Dropout · Dropout · Adam · Layer Normalization · Residual Connection