Induction Meets Biology: Mechanisms of Repeat Detection in Protein Language Models
Gal Kesten-Pomeranz, Yaniv Nikankin, Anja Reusch, Tomer Tsaban, Ora Schueler-Furman, Yonatan Belinkov

TL;DR
This paper investigates how protein language models detect repeats in protein sequences, revealing a two-stage mechanism involving feature representation and attention to aligned tokens, blending language patterns with biological knowledge.
Contribution
It uncovers the internal mechanisms of PLMs in detecting protein repeats, showing how they combine pattern matching with biological features, advancing understanding of their biological reasoning.
Findings
PLMs use both positional attention and specialized neurons for repeat detection.
Approximate repeat detection subsumes exact repeat mechanisms.
PLMs attend to aligned tokens across repeats to identify them.
Abstract
Protein sequences are abundant in repeating segments, both as exact copies and as approximate segments with mutations. These repeats are important for protein structure and function, motivating decades of algorithmic work on repeat identification. Recent work has shown that protein language models (PLMs) identify repeats, by examining their behavior in masked-token prediction. To elucidate their internal mechanisms, we investigate how PLMs detect both exact and approximate repeats. We find that the mechanism for approximate repeats functionally subsumes that of exact repeats. We then characterize this mechanism, revealing two main stages: PLMs first build feature representations using both general positional attention heads and biologically specialized components, such as neurons that encode amino-acid similarity. Then, induction heads attend to aligned tokens across repeated segments,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Rare Diseases · Genomics and Phylogenetic Studies · Machine Learning in Bioinformatics
