Reverse Distillation: Consistently Scaling Protein Language Model Representations

Darius Catrina; Christian Bepler; Samuel Sledzieski; Rohit Singh

arXiv:2603.07710·cs.LG·March 10, 2026

Reverse Distillation: Consistently Scaling Protein Language Model Representations

Darius Catrina, Christian Bepler, Samuel Sledzieski, Rohit Singh

PDF

Open Access 1 Models 3 Reviews

TL;DR

This paper introduces Reverse Distillation, a novel framework for improving protein language models by decomposing large model representations into shared and unique features, leading to consistent performance gains across various tasks.

Contribution

The paper proposes Reverse Distillation, a new method that enhances large protein language models by orthogonally decomposing their representations based on smaller models, improving performance.

Findings

01

Reverse distillation improves model performance at the same embedding size.

02

The 15 billion parameter reverse-distilled model achieves top results on ProteinGym.

03

The framework is applicable to other model families with scaling issues.

Abstract

Unlike the predictable scaling laws in natural language processing and computer vision, protein language models (PLMs) scale poorly: for many tasks, models within the same family plateau or even decrease in performance, with mid-sized models often outperforming the largest in the family. We introduce Reverse Distillation, a principled framework that decomposes large PLM representations into orthogonal subspaces guided by smaller models of the same family. The resulting embeddings have a nested, Matryoshka-style structure: the first k dimensions of a larger model's embedding are exactly the representation from the smaller model. This ensures that larger reverse-distilled models consistently outperform smaller ones. A motivating intuition is that smaller models, constrained by capacity, preferentially encode broadly-shared protein features. Reverse distillation isolates these shared…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. The paper establishes the central problem: PLMs "scale relatively poorly" , with the ESM-2 family's performance plateauing. The authors' core hypothesis is highly intuitive that this is due to larger models "entangling" universal (low-level) and specialized (high-level) features, which increases variance. 2. This work introduces high-performing and efficient embeddings. The resulting models outperform baselines at the same size. They also feature a "Matryoshka-style" structure, which allows

Weaknesses

1. The authors should at least attempt a reverse distillation of 3B $\rightarrow$ 15B (or the full chain up to 15B). This experiment is critical. If rd.15B outperforms rd.3B, the paper's core thesis is validated. If rd.15B still underperforms, it would suggest the scaling problem is more complex than just feature entanglement, fundamentally weakening the paper's conclusion. 2. This linear-only approach may be restrictive. The paper itself hypothesizes that larger models encode rarer, higher-orde

Reviewer 02Rating 8Confidence 3

Strengths

This paper is quite strong in my opinion. The proposed methods are a clear improvement on current approaches and is a valuable contribution to the protein representation field.

Weaknesses

* Additional inference time though not prohibitive could still limit adoption.

Reviewer 03Rating 8Confidence 3

Strengths

1. The paper tackles a critical, well-documented problem (PLM scaling failure ) with a highly novel solution. The idea of using smaller models as a basis for post-hoc orthogonal decomposition is elegant and new. 2. The experiments persuasively demonstrate that RD works. It not only improves baseline performance (e.g., rd.3B > 3B) but, more importantly, it restores monotonic scaling (rd.3B > rd.650M wins 96.4% of the time, vs. 53.6% for the baseline). 3.The BioMap experiment (Table 4) provides s

Weaknesses

The paper's primary weakness is the exclusion of the ESM-2 15B model.2 The most severe example of scaling failure is the performance degradation from 3B to 15B.2 The paper only demonstrates fixing the 650M-to-3B plateau. Without testing the 15B model, the central claim of "solving" the scaling paradox is incomplete. The core idea of using orthogonal subspaces to separate/disentangle knowledge, while novel in this application, is conceptually similar to methods in continual learning (e.g., O-LoRA

Code & Models

Models

🤗
singhlab/plm_reverse_distillation
model· 13 dl
13 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Bioinformatics · Biomedical Text Mining and Ontologies · Genomics and Rare Diseases