Mechanistic Interpretability of Antibody Language Models Using SAEs
Rebonto Haque, Oliver M. Turnbull, Anisha Parsan, Nithin Parsan, John J. Yang, Anna L. Beukenhorst, Charlotte M. Deane

TL;DR
This paper explores the use of sparse autoencoders to interpret and steer antibody language models, revealing trade-offs between different SAE methods in biological feature identification and control.
Contribution
It introduces and compares TopK and Ordered SAEs for mechanistic interpretability and steering in antibody language models, advancing understanding of their capabilities.
Findings
TopK SAEs reveal biologically meaningful features
High feature-concept correlation does not imply causal control
Ordered SAEs reliably identify steerable features
Abstract
Sparse autoencoders (SAEs) are a mechanistic interpretability technique that have been used to provide insight into learned concepts within large protein language models. Here, we employ TopK and Ordered SAEs to investigate autoregressive antibody language models, and steer their generation. We show that TopK SAEs can reveal biologically meaningful latent features, but high feature-concept correlation does not guarantee causal control over generation. In contrast, Ordered SAEs impose a hierarchical structure that reliably identifies steerable features, but at the expense of more complex and less interpretable activation patterns. These findings advance the mechanistic interpretability of domain-specific protein language models and suggest that, while TopK SAEs suffice for mapping latent features to concepts, Ordered SAEs are preferable when precise generative steering is required.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
