Mechanistic Interpretability of Antibody Language Models Using SAEs

Rebonto Haque; Oliver M. Turnbull; Anisha Parsan; Nithin Parsan; John J. Yang; Anna L. Beukenhorst; Charlotte M. Deane

arXiv:2512.05794·cs.LG·April 27, 2026

Mechanistic Interpretability of Antibody Language Models Using SAEs

Rebonto Haque, Oliver M. Turnbull, Anisha Parsan, Nithin Parsan, John J. Yang, Anna L. Beukenhorst, Charlotte M. Deane

PDF

TL;DR

This paper explores the use of sparse autoencoders to interpret and steer antibody language models, revealing trade-offs between different SAE methods in biological feature identification and control.

Contribution

It introduces and compares TopK and Ordered SAEs for mechanistic interpretability and steering in antibody language models, advancing understanding of their capabilities.

Findings

01

TopK SAEs reveal biologically meaningful features

02

High feature-concept correlation does not imply causal control

03

Ordered SAEs reliably identify steerable features

Abstract

Sparse autoencoders (SAEs) are a mechanistic interpretability technique that have been used to provide insight into learned concepts within large protein language models. Here, we employ TopK and Ordered SAEs to investigate autoregressive antibody language models, and steer their generation. We show that TopK SAEs can reveal biologically meaningful latent features, but high feature-concept correlation does not guarantee causal control over generation. In contrast, Ordered SAEs impose a hierarchical structure that reliably identifies steerable features, but at the expense of more complex and less interpretable activation patterns. These findings advance the mechanistic interpretability of domain-specific protein language models and suggest that, while TopK SAEs suffice for mapping latent features to concepts, Ordered SAEs are preferable when precise generative steering is required.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.