Unlocking Efficiency: Adaptive Masking for Gene Transformer Models

Soumyadeep Roy; Shamik Sural; Niloy Ganguly

arXiv:2408.07180·cs.CL·October 23, 2024

Unlocking Efficiency: Adaptive Masking for Gene Transformer Models

Soumyadeep Roy, Shamik Sural, Niloy Ganguly

PDF

Open Access 1 Repo

TL;DR

This paper introduces a curriculum masking strategy for gene transformer models that improves training efficiency and representation quality, enabling comparable performance with fewer training steps.

Contribution

It proposes CM-GEMS, a novel curriculum masking approach based on mutual information, enhancing gene model training efficiency and downstream task performance.

Findings

01

CM-GEMS outperforms baseline masking methods in gene classification tasks.

02

Models trained with CM-GEMS reach similar accuracy in fewer steps.

03

Curriculum learning significantly reduces training time for gene transformers.

Abstract

Gene transformer models such as Nucleotide Transformer, DNABert, and LOGO are trained to learn optimal gene sequence representations by using the Masked Language Modeling (MLM) training objective over the complete Human Reference Genome. However, the typical tokenization methods employ a basic sliding window of tokens, such as k-mers, that fail to utilize gene-centric semantics. This could result in the (trivial) masking of easily predictable sequences, leading to inefficient MLM training. Time-variant training strategies are known to improve pretraining efficiency in both language and vision tasks. In this work, we focus on using curriculum masking where we systematically increase the difficulty of masked token prediction task by using a Pointwise Mutual Information-based difficulty criterion, as gene sequences lack well-defined semantic units similar to words or sentences of NLP…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

roysoumya/curriculum-genemask
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEvolutionary Algorithms and Applications · Machine Learning and Data Classification · Gene Regulatory Network Analysis

MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Softmax · Absolute Position Encodings · Dense Connections