TourSynbio: A Multi-Modal Large Model and Agent Framework to Bridge Text and Protein Sequences for Protein Engineering
Yiqing Shen, Zan Chen, Michail Mamalakis, Yungeng Liu, Tianbin Li,, Yanzhou Su, Junjun He, Pietro Li\`o, Yu Guang Wang

TL;DR
This paper introduces TourSynbio-7B, a multi-modal large language model that inherently understands proteins as language, enabling advanced protein engineering tasks without external encoders, and demonstrates superior performance over GPT-4 on relevant benchmarks.
Contribution
The paper presents the first multi-modal large model for protein engineering that learns protein understanding internally, reducing complexity and enhancing performance compared to previous methods.
Findings
TourSynbio-7B outperforms GPT-4 on ProteinLMBench with 62.18% accuracy.
TourSynbio-Agent enables versatile protein engineering tasks via a unified interface.
Wet lab case studies validate the model's practical effectiveness.
Abstract
The structural similarities between protein sequences and natural languages have led to parallel advancements in deep learning across both domains. While large language models (LLMs) have achieved much progress in the domain of natural language processing, their potential in protein engineering remains largely unexplored. Previous approaches have equipped LLMs with protein understanding capabilities by incorporating external protein encoders, but this fails to fully leverage the inherent similarities between protein sequences and natural languages, resulting in sub-optimal performance and increased model complexity. To address this gap, we present TourSynbio-7B, the first multi-modal large model specifically designed for protein engineering tasks without external protein encoders. TourSynbio-7B demonstrates that LLMs can inherently learn to understand proteins as language. The model is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenetics, Bioinformatics, and Biomedical Research · Evolutionary Algorithms and Applications · Machine Learning in Bioinformatics
MethodsLinear Layer · Adam · Layer Normalization · Attention Is All You Need · Position-Wise Feed-Forward Layer · Dense Connections · Residual Connection · Multi-Head Attention · Byte Pair Encoding · Absolute Position Encodings
