Find Central Dogma Again: Leveraging Multilingual Transfer in Large Language Models
Wang Liang

TL;DR
This paper demonstrates that large language models can rediscover the genetic central dogma by leveraging multilingual transfer capabilities, achieving significant accuracy in DNA-protein sequence alignment without prior biological knowledge.
Contribution
It introduces a novel approach of using multilingual transfer in LLMs to uncover fundamental biological laws, specifically the central dogma, through sequence alignment tasks.
Findings
Achieved 81% accuracy in DNA-protein sequence alignment.
Showed LLMs can rediscover biological principles without prior knowledge.
Analyzed factors influencing zero-shot learning capabilities.
Abstract
In recent years, large language models (LLMs) have achieved state-of-the-art results in various biological sequence analysis tasks, such as sequence classification, structure prediction, and function prediction. Similar to advancements in AI for other scientific fields, deeper research into biological LLMs has begun to focus on using these models to rediscover important existing biological laws or uncover entirely new patterns in biological sequences. This study leverages GPT-like LLMs to utilize language transfer capabilities to rediscover the genetic code rules of the central dogma. In our experimental design, we transformed the central dogma into a binary classification problem of aligning DNA sequences with protein sequences, where positive examples are matching DNA and protein sequences, and negative examples are non-matching pairs. We first trained a GPT-2 model from scratch using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · RNA and protein synthesis mechanisms · Genomics and Rare Diseases
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Dense Connections · Linear Layer · Multi-Head Attention · Adam · Softmax · Dropout · Weight Decay
