Scaling Up ESM2 Architectures for Long Protein Sequences Analysis: Long   and Quantized Approaches

Gabriel Bianchin de Oliveira; Helio Pedrini; Zanoni Dias

arXiv:2501.07747·cs.LG·January 15, 2025

Scaling Up ESM2 Architectures for Long Protein Sequences Analysis: Long and Quantized Approaches

Gabriel Bianchin de Oliveira, Helio Pedrini, Zanoni Dias

PDF

TL;DR

This paper introduces long and quantized versions of the ESM2 architectures, effectively doubling the input size limit for protein sequence analysis from 1,022 to 2,048 amino acids, enhancing their capability to process longer sequences.

Contribution

The paper presents novel long and quantized ESM2 architectures that increase input size capacity, enabling better analysis of longer protein sequences without preprocessing.

Findings

01

Doubling input size limit to 2,048 amino acids.

02

Improved ability to analyze longer protein sequences.

03

Enhanced performance over previous ESM2 models.

Abstract

Various approaches utilizing Transformer architectures have achieved state-of-the-art results in Natural Language Processing (NLP). Based on this success, numerous architectures have been proposed for other types of data, such as in biology, particularly for protein sequences. Notably among these are the ESM2 architectures, pre-trained on billions of proteins, which form the basis of various state-of-the-art approaches in the field. However, the ESM2 architectures have a limitation regarding input size, restricting it to 1,022 amino acids, which necessitates the use of preprocessing techniques to handle sequences longer than this limit. In this paper, we present the long and quantized versions of the ESM2 architectures, doubling the input size limit to 2,048 amino acids.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Absolute Position Encodings · Adam · Residual Connection · Dropout · Softmax · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer