Beyond Student: An Asymmetric Network for Neural Network Inheritance

Yiyun Zhou; Jingwei Shi; Mingjing Xu; Zhonghua Jiang; Jingyuan Chen

arXiv:2602.09509·cs.LG·February 12, 2026

Beyond Student: An Asymmetric Network for Neural Network Inheritance

Yiyun Zhou, Jingwei Shi, Mingjing Xu, Zhonghua Jiang, Jingyuan Chen

PDF

Open Access 3 Reviews

TL;DR

This paper introduces InherNet, a novel asymmetric network that inherits and reconstructs a teacher network's knowledge using low-rank decomposition, achieving better performance than traditional student networks in model compression.

Contribution

InherNet is the first network inheritance method that leverages asymmetric low-rank decomposition and SVD initialization to inherit and enhance teacher network knowledge.

Findings

01

InherNet outperforms student networks of similar size across multiple tasks.

02

The method effectively balances depth, width, and compression efficiency.

03

Experimental results validate the superiority of InherNet over traditional distillation approaches.

Abstract

Knowledge Distillation (KD) has emerged as a powerful technique for model compression, enabling lightweight student networks to benefit from the performance of redundant teacher networks. However, the inherent capacity gap often limits the performance of student networks. Inspired by the expressiveness of pretrained teacher networks, a compelling research question arises: is there a type of network that can not only inherit the teacher's structure but also maximize the inheritance of its knowledge? Furthermore, how does the performance of such an inheriting network compare to that of student networks, all benefiting from the same teacher network? To further explore this question, we propose InherNet, a neural network inheritance method that performs asymmetric low-rank decomposition on the teacher's weights and reconstructs a lightweight yet expressive network without significant…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 2

Strengths

1. The authos also provide a rigorous mathematical analysis of the InherNet architecture, such as (1) convergence guarantees under standard assumptions, (2) proofs of parameter efficiency and knowledge preservation based on singular value spectrum, (3) formal definitions of Parameter Efficiency, Expressivity-to-Parameter Ratio, and Approximation Error. 2. The design of a fixed asymmetric expert-head structure combined with SVD-based decomposition is novel. The paper provides a clear architectur

Weaknesses

1. I noticed is that a couple of the baselines, like MLKD and Logit Std., were trained for more epochs than other methods. That makes it tricky to know if InherNet is really better, or if the training schedule just favors it. For a fair comparison, I think all methods should be evaluated under the same settings. Otherwise, the results lose a bit of their strength. 2. In the analysis section, the authors mention that distillation can hurt performance for large InherNet models, which is super int

Reviewer 02Rating 6Confidence 4

Strengths

1. The proposed SVD-Driven NNI algorithm directly compresses the teacher models instead of training a separate student model during the knowledge distillation process. Therefore, the proposed algorithm can build more complex but lightweight network architectures without harming the number of model parameters and inference time. 2. The proposed algorithm is supported by a comprehensive theory analysis. Thus, readers can fully understand the advantages of the proposed algorithm. 3. The paper dem

Weaknesses

1. The proposed algorithm performs layer-wise compression during the knowledge distillation process. It does not account for the correlation between layers when performing layer-wise compression. Due to this, the proposed algorithm might have suboptimal results after compression.

Reviewer 03Rating 4Confidence 3

Strengths

The paper systematically introduces the "Network Inheritance (NI)" paradigm for the first time, breaking away from the conventional knowledge distillation (KD) approach that transfers knowledge solely through soft labels. Comprehensive evaluations on multiple datasets, including CIFAR-100, GLUE, and CC3M, demonstrate the generality and robustness of the proposed method.

Weaknesses

1. The related work “Matrix Compression via Randomized Low Rank and Low Precision Factorization” also employs LoRA for model compression. The authors need to clarify the distinction between their approach and that work, beyond merely stating that they come from different research domains. 2. The authors should provide an analysis of model performance under varying compression ratios, showing how performance changes as the compression rate increases. 3. The role of the MoE (Mixture of Experts)

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning