Structure-Aware Masking for Protein Representation Learning
Thomas Walton, Ayan Goel, Amirali Aghazadeh

TL;DR
This paper proposes Bucket Masking, a structure-aware masking strategy for protein language models that improves the modeling of long-range interactions by selecting residue groups based on 3D proximity, leading to better fitness prediction.
Contribution
Introduction of Bucket Masking, a novel structure-aware masking method that enhances protein representation learning by focusing on 3D structural dependencies.
Findings
Up to 14% improvement in protein fitness prediction tasks.
Better modeling of long-range residue interactions.
Mask placement, not span size, drives improvements.
Abstract
Masked language modeling (MLM) is the standard objective for training protein language models, typically implemented by randomly masking individual residues at a fixed rate (e.g., 15%). This practice implicitly assumes that all sequence positions contribute equally to representation learning. In downstream fitness prediction tasks, however, protein sequences are governed by three-dimensional structural dependencies and long-range residue contacts that induce strong nonlocal couplings between residues. We introduce Bucket Masking, a structure-aware masking strategy that selects groups of residues based on their proximity in three-dimensional space, preferentially masking structurally coupled regions during training. By conditioning the masking distribution on residue contacts, Bucket Masking shifts the learning objective toward modeling long-range interactions that are critical for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
