Reverse Distillation: Consistently Scaling Protein Language Model Representations
Darius Catrina, Christian Bepler, Samuel Sledzieski, Rohit Singh

TL;DR
This paper introduces Reverse Distillation, a novel framework for improving protein language models by decomposing large model representations into shared and unique features, leading to consistent performance gains across various tasks.
Contribution
The paper proposes Reverse Distillation, a new method that enhances large protein language models by orthogonally decomposing their representations based on smaller models, improving performance.
Findings
Reverse distillation improves model performance at the same embedding size.
The 15 billion parameter reverse-distilled model achieves top results on ProteinGym.
The framework is applicable to other model families with scaling issues.
Abstract
Unlike the predictable scaling laws in natural language processing and computer vision, protein language models (PLMs) scale poorly: for many tasks, models within the same family plateau or even decrease in performance, with mid-sized models often outperforming the largest in the family. We introduce Reverse Distillation, a principled framework that decomposes large PLM representations into orthogonal subspaces guided by smaller models of the same family. The resulting embeddings have a nested, Matryoshka-style structure: the first k dimensions of a larger model's embedding are exactly the representation from the smaller model. This ensures that larger reverse-distilled models consistently outperform smaller ones. A motivating intuition is that smaller models, constrained by capacity, preferentially encode broadly-shared protein features. Reverse distillation isolates these shared…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper establishes the central problem: PLMs "scale relatively poorly" , with the ESM-2 family's performance plateauing. The authors' core hypothesis is highly intuitive that this is due to larger models "entangling" universal (low-level) and specialized (high-level) features, which increases variance. 2. This work introduces high-performing and efficient embeddings. The resulting models outperform baselines at the same size. They also feature a "Matryoshka-style" structure, which allows
1. The authors should at least attempt a reverse distillation of 3B $\rightarrow$ 15B (or the full chain up to 15B). This experiment is critical. If rd.15B outperforms rd.3B, the paper's core thesis is validated. If rd.15B still underperforms, it would suggest the scaling problem is more complex than just feature entanglement, fundamentally weakening the paper's conclusion. 2. This linear-only approach may be restrictive. The paper itself hypothesizes that larger models encode rarer, higher-orde
This paper is quite strong in my opinion. The proposed methods are a clear improvement on current approaches and is a valuable contribution to the protein representation field.
* Additional inference time though not prohibitive could still limit adoption.
1. The paper tackles a critical, well-documented problem (PLM scaling failure ) with a highly novel solution. The idea of using smaller models as a basis for post-hoc orthogonal decomposition is elegant and new. 2. The experiments persuasively demonstrate that RD works. It not only improves baseline performance (e.g., rd.3B > 3B) but, more importantly, it restores monotonic scaling (rd.3B > rd.650M wins 96.4% of the time, vs. 53.6% for the baseline). 3.The BioMap experiment (Table 4) provides s
The paper's primary weakness is the exclusion of the ESM-2 15B model.2 The most severe example of scaling failure is the performance degradation from 3B to 15B.2 The paper only demonstrates fixing the 650M-to-3B plateau. Without testing the 15B model, the central claim of "solving" the scaling paradox is incomplete. The core idea of using orthogonal subspaces to separate/disentangle knowledge, while novel in this application, is conceptually similar to methods in continual learning (e.g., O-LoRA
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Biomedical Text Mining and Ontologies · Genomics and Rare Diseases
