Beyond Neural Incompatibility: Cross-Scale Knowledge Transfer in Language Models through Latent Semantic Alignment

Jian Gu; Aldeida Aleti; Chunyang Chen; Hongyu Zhang

arXiv:2510.24208·cs.CL·May 19, 2026

Beyond Neural Incompatibility: Cross-Scale Knowledge Transfer in Language Models through Latent Semantic Alignment

Jian Gu, Aldeida Aleti, Chunyang Chen, Hongyu Zhang

PDF

3 Reviews

TL;DR

This paper introduces SemAlign, a method for fine-grained, cross-scale knowledge transfer in language models using latent semantic alignment via activations, improving transfer effectiveness across different architectures.

Contribution

The paper proposes a novel semantic alignment approach that transfers knowledge through activations rather than parameters, enabling effective cross-scale transfer between diverse language models.

Findings

01

SemAlign improves transfer performance on four benchmarks.

02

Semantic decomposition and recomposition stabilize cross-scale transfer.

03

Layer-wise semantic supervision enhances transfer quality.

Abstract

Language Models (LMs) encode substantial knowledge in their parameters, yet it remains unclear how to transfer such knowledge in a fine-grained manner, namely parametric knowledge transfer (PKT). A central challenge is to make cross-scale transfer effective and efficient when source and target models differ in architecture and parameterization, making direct parameter reuse strongly limited by neural incompatibility. In this paper, we identify latent semantic alignment as the key prerequisite for cross-scale knowledge transfer. Instead of directly moving layer parameters, our approach uses activations as the transfer medium. \textsc{SemAlign} has two stages: an \emph{layer attribution} stage that attributes task-relevant source layers and selects exactly one source layer for each target layer, and a \emph{semantic alignment} stage that pairs them layer by layer and optimizes the target…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 2

Strengths

The approach is simpler than previous work focusing on layer outputs rather than weights or logits

Weaknesses

- The paper has a narrow literature focus comparing against two prior works only - The extracted shared basis between the teacher and student models is overcomplete so the "resultant semantic" equation is not correct - The models have to share the same vocabulary, limiting the applicability of this method

Reviewer 02Rating 4Confidence 3

Strengths

1. The paper proposes a new semantics-first perspective on parametric knowledge transfer (PKT), which targets the "neural incompatibility" bottleneck by using latent activations instead of raw parameters. 2. The proposed SemAlign method achieves empirical gains over existing PKT baselines on certain benchmarks, such as MMLU and HumanEval. The paper also reports advantages when transferring from specialized, code-focused teacher models. 3. The paper includes an analysis using CKA (Figure 4) to

Weaknesses

1. **Weak Empirical Validation for the Core "Semantic Basis" Mechanism.** The paper's central claim is on the superiority of transferring knowledge via "semantic alignment" rather than direct parameter manipulation. However, the empirical evidence provided specifically for this mechanism is thin. The concept of "Vocabulary-Defined Semantics" is adopted from prior work, and its validation within this paper is limited to a single experiment in Figure 2. This experiment, which validates the "resolu

Reviewer 03Rating 2Confidence 3

Strengths

- The problem of architecture-agnostic knowledge distillation is relevant and interesting. - The presented performance results are promising and warrant further research in semantics-aware activation distillation.

Weaknesses

- The paper lacks innovation and novelty. The main contribution of the paper centers on layer-wise semantics-aware distillation in settings where the student differs architecturally from the teacher. However, the semantics decomposition is introduced by Gu et al. 2024. Moreover, the architectural differences are simply resolved by mapping all activations from layers exceeding the depth in the student into the last layer. This is lacking experiments and theoretical grounding. - Performance benefi

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.