Structure-Aware Masking for Protein Representation Learning

Thomas Walton; Ayan Goel; Amirali Aghazadeh

arXiv:2605.16581·cs.LG·May 19, 2026

Structure-Aware Masking for Protein Representation Learning

Thomas Walton, Ayan Goel, Amirali Aghazadeh

PDF

TL;DR

This paper proposes Bucket Masking, a structure-aware masking strategy for protein language models that improves the modeling of long-range interactions by selecting residue groups based on 3D proximity, leading to better fitness prediction.

Contribution

Introduction of Bucket Masking, a novel structure-aware masking method that enhances protein representation learning by focusing on 3D structural dependencies.

Findings

01

Up to 14% improvement in protein fitness prediction tasks.

02

Better modeling of long-range residue interactions.

03

Mask placement, not span size, drives improvements.

Abstract

Masked language modeling (MLM) is the standard objective for training protein language models, typically implemented by randomly masking individual residues at a fixed rate (e.g., 15%). This practice implicitly assumes that all sequence positions contribute equally to representation learning. In downstream fitness prediction tasks, however, protein sequences are governed by three-dimensional structural dependencies and long-range residue contacts that induce strong nonlocal couplings between residues. We introduce Bucket Masking, a structure-aware masking strategy that selects groups of residues based on their proximity in three-dimensional space, preferentially masking structurally coupled regions during training. By conditioning the masking distribution on residue contacts, Bucket Masking shifts the learning objective toward modeling long-range interactions that are critical for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.