M5: A Whole Genome Bacterial Encoder at Single Nucleotide Resolution
Agust Egilsson

TL;DR
This paper introduces M5, a novel transformer-based model with linear attention for bacterial genomes, capable of processing sequences at single nucleotide resolution across entire genomes, demonstrating improved performance and stability.
Contribution
The paper presents M5, a new genome encoder with linear attention that extends transformer context length to millions of nucleotides, enabling whole genome modeling at single nucleotide resolution.
Findings
Notable performance improvements with increasing genome length.
Stable approximation of full attention at long sequence lengths.
Efficient training on a single GPU for large genomic sequences.
Abstract
A linear attention mechanism is described to extend the context length of an encoder only transformer, called M5 in this report, to a multi-million single nucleotide resolution foundation model pretrained on bacterial whole genomes. The linear attention mechanism used approximates a full quadratic attention mechanism tightly and has a simple and lightweight implementation for the use case when the key-query embedding dimensionality is low. The M5-small model is entirely trained and tested on one A100 GPU with 40gb of memory up to 196K nucleotides during training and 2M nucleotides during testing. We test the performance of the M5-small model and record notable improvements in performance as whole genome bacterial sequence lengths are increased as well as demonstrating the stability of the full multi-head attention approximation used as sequence length is increased.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Bacteriophages and microbial interactions · Microbial Community Ecology and Physiology
MethodsAttention Is All You Need · Softmax · Linear Layer · Multi-Head Attention
