BioBridge: Bridging Proteins and Language for Enhanced Biological Reasoning with LLMs
Yujia Wang, Jihong Guan, Wengen Li, Shuigeng Zhou, Xuhong Wang

TL;DR
BioBridge is a novel framework that enhances large language models with protein domain knowledge through continual pretraining and cross-modal alignment, improving performance on biological and general tasks.
Contribution
It introduces a domain-adaptive continual pretraining method and a cross-modal alignment pipeline to integrate protein knowledge into LLMs, enabling versatile biological reasoning.
Findings
Achieves comparable performance to specialized PLMs on protein benchmarks
Performs on par with LLMs on general understanding tasks
Effectively mitigates catastrophic forgetting during domain adaptation
Abstract
Existing Protein Language Models (PLMs) often suffer from limited adaptability to multiple tasks and exhibit poor generalization across diverse biological contexts. In contrast, general-purpose Large Language Models (LLMs) lack the capability to interpret protein sequences and fall short in domain-specific knowledge, limiting their capacity for effective biosemantic reasoning. To combine the advantages of both, we propose BioBridge, a domain-adaptive continual pretraining framework for protein understanding. This framework employs Domain-Incremental Continual Pre-training (DICP) to infuse protein domain knowledge and general reasoning corpus into a LLM simultaneously, effectively mitigating catastrophic forgetting. Cross-modal alignment is achieved via a PLM-Projector-LLM pipeline, which maps protein sequence embeddings into the semantic space of the language model. Ultimately, an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Genomics and Rare Diseases · Biomedical Text Mining and Ontologies
