Representation Convergence: Mutual Distillation is Secretly a Form of Regularization
Zhengpeng Xie, Jiahang Cao, Changwei Wang, Fan Yang, Marco Hutter, Qiang Zhang, Jianxiong Zhang, Renjing Xu

TL;DR
This paper reveals that mutual distillation in reinforcement learning acts as a regularizer, improving policy robustness and generalization by fostering invariant representations, supported by theoretical proofs and empirical evidence.
Contribution
It provides the first theoretical proof linking policy robustness to generalization and empirically shows mutual distillation promotes invariant representations.
Findings
Mutual distillation enhances policy robustness.
Invariant representations emerge spontaneously.
Improved generalization performance is observed.
Abstract
In this paper, we argue that mutual distillation between reinforcement learning policies serves as an implicit regularization, preventing them from overfitting to irrelevant features. We highlight two separate contributions: (i) Theoretically, for the first time, we prove that enhancing the policy robustness to irrelevant features leads to improved generalization performance. (ii) Empirically, we demonstrate that mutual distillation between policies contributes to such robustness, enabling the spontaneous emergence of invariant representations over pixel inputs. Ultimately, we do not claim to achieve state-of-the-art performance but rather focus on uncovering the underlying principles of generalization and deepening our understanding of its mechanisms.
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The paper presents a new theoretical framework to investigate generalisation issues in deep RL. Generalisation in RL is a major and actively researched topic. The insights provided by the paper will have far reaching impact.
Although impactful, the experimental evaluation is limited. In the sense that, it doesn't demonstrate the phenomenon exists, beyond testing on the ProcGen benchmark and presenting performance. Also, apart from the toy example. There are other methods focusing on distillation (mutual or peer). However, these papers seem not to be mentioned in the paper. It would be good to see a comparison, for example, * Periodic Intra-Ensemble Knowledge Distillation for Reinforcement Learning, https://arxiv
### Strengths: 1. This work presents a formal proof of a long standing assumption that robustness of policy against irrelevant features improves generalization. In particular, they derive a lower bound for generalization performance that includes minimization of a robustness term, which is defined how a policy is influenced by two different rendering (perturbation) functions. 2. The paper is presented in a clear and well-organized way. Especially, Fig. 1 and 2 helps the readers to better under
### Weaknesses: 1. The proposed method has been validated only on the ProcGen benchmark. Experiments on more diverse set up is needed to show the applicability of such methods. 2. While I understand that the target is not to outperform the state-of-the arts, but how DML stands against other data augmentation based approaches such as [1] are not evident. While the authors present result with SPO, it seems SPO performance itself is not upto the current standard. 3. The proposed method relies
- The central idea of using mutual distillation to induce robustness to spurious correlations in the training data is interesting. - The empirical results for MDPO look promising. - The analytical experiments of the robustness of the MDPO policy to visual disturbances and the quality of the learned representations in Sections 5.3 and 5.4 are very insightful.
- The positioning of this work is severely lacking, especially concerning previous literature. - The paper seems to miss several related works on topics such as: - Representation learning in RL (for example, [1,2]) - Policy distillation for generalization (for example, [3,4]) - Mutual distillation (for example, [5]) - Overfitting to training data in RL (for example, [6,7,8]) - The above are not necessarily exhaustive. - The claims regarding theoretical novelty are strong, but dif
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques
