From Sentences to Sequences: Rethinking Languages in Biological System
Ke Liu, Shuaike Shen, Hao Chen

TL;DR
This paper explores how language modeling techniques from NLP can be adapted to biological sequences by considering the 3D structure of biomolecules, emphasizing the importance of structural evaluation.
Contribution
It redefines biological language modeling by integrating structural information, demonstrating the applicability of auto-regressive models to biological sequences.
Findings
Structural evaluation improves biological language modeling accuracy.
Auto-regressive paradigm is effective for biological sequences.
Considering 3D structure enhances understanding of biomolecular languages.
Abstract
The paradigm of large language models in natural language processing (NLP) has also shown promise in modeling biological languages, including proteins, RNA, and DNA. Both the auto-regressive generation paradigm and evaluation metrics have been transferred from NLP to biological sequence modeling. However, the intrinsic structural correlations in natural and biological languages differ fundamentally. Therefore, we revisit the notion of language in biological systems to better understand how NLP successes can be effectively translated to biological domains. By treating the 3D structure of biomolecules as the semantic content of a sentence and accounting for the strong correlations between residues or bases, we highlight the importance of structural evaluation and demonstrate the applicability of the auto-regressive paradigm in biological language modeling. Code can be found at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
