Ankh: Optimized Protein Language Model Unlocks General-Purpose Modelling
Ahmed Elnaggar, Hazem Essam, Wafaa Salah-Eldin, Walid Moustafa,, Mohamed Elkerdawy, Charlotte Rochereau, and Burkhard Rost

TL;DR
Ankh is a cost-effective, protein-specific language model that outperforms larger models in understanding protein structure and function, emphasizing accessibility and data efficiency.
Contribution
We introduce Ankh, a novel general-purpose protein language model trained efficiently with fewer resources, achieving state-of-the-art results and insights into protein evolution and diversity.
Findings
Ankh surpasses state-of-the-art performance with fewer parameters.
It effectively learns protein conservation-mutation trends.
It maintains structural and functional integrity in generated variants.
Abstract
As opposed to scaling-up protein language models (PLMs), we seek improving performance via protein-specific optimization. Although the proportionality between the language model size and the richness of its learned representations is validated, we prioritize accessibility and pursue a path of data-efficient, cost-reduced, and knowledge-guided optimization. Through over twenty experiments ranging from masking, architecture, and pre-training data, we derive insights from protein-specific experimentation into building a model that interprets the language of life, optimally. We present Ankh, the first general-purpose PLM trained on Google's TPU-v4 surpassing the state-of-the-art performance with fewer parameters (<10% for pre-training, <7% for inference, and <30% for the embedding dimension). We provide a representative range of structure and function benchmarks where Ankh excels. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Synthyra/ANKH_largemodel· 412 dl· ♡ 1412 dl♡ 1
- 🤗Synthyra/ANKH_basemodel· 610 dl610 dl
- 🤗ElnaggarLab/ankh2-largemodel· 17 dl· ♡ 317 dl♡ 3
- 🤗ElnaggarLab/ankh2-ext1model· 7 dl· ♡ 17 dl♡ 1
- 🤗ElnaggarLab/ankh2-ext2model· 150 dl· ♡ 1150 dl♡ 1
- 🤗Synthyra/ANKH2_largemodel· 391 dl· ♡ 1391 dl♡ 1
- 🤗Synthyra/ANKH3_largemodel· 327 dl327 dl
- 🤗Synthyra/ANKH3_xlmodel· 162 dl162 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Software Engineering Research · Machine Learning in Materials Science
