Linguistically inspired roadmap for building biologically reliable protein language models
Mai Ha Vu, Rahmad Akbar, Philippe A. Robert, Bartlomiej Swiatczak,, Victor Greiff, Geir Kjetil Sandve, Dag Trygve Truslew Haug

TL;DR
This paper proposes a linguistically inspired framework for developing interpretable protein language models, aiming to enhance understanding of sequence-function relationships in proteins for better biotherapeutic development.
Contribution
It introduces a novel roadmap integrating linguistic principles into protein language model design, addressing interpretability and domain-specific knowledge incorporation.
Findings
Linguistic concepts improve protein model interpretability.
Guidelines for training data, tokenization, and embedding in protein LMs.
Potential to uncover biological mechanisms underlying sequence-function links.
Abstract
Deep neural-network-based language models (LMs) are increasingly applied to large-scale protein sequence data to predict protein function. However, being largely black-box models and thus challenging to interpret, current protein LM approaches do not contribute to a fundamental understanding of sequence-function mappings, hindering rule-based biotherapeutic drug development. We argue that guidance drawn from linguistics, a field specialized in analytical rule extraction from natural language data, can aid with building more interpretable protein LMs that are more likely to learn relevant domain-specific rules. Differences between protein sequence data and linguistic sequence data require the integration of more domain-specific knowledge in protein LMs compared to natural language LMs. Here, we provide a linguistics-based roadmap for protein LM pipeline choices with regard to training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · RNA and protein synthesis mechanisms · Computational Drug Discovery Methods
