Coupled Distributional Random Expert Distillation for World Model Online Imitation Learning
Shangzhe Li, Zhiao Huang, Hao Su

TL;DR
This paper introduces a stable online imitation learning method using a random network distillation-based reward model that estimates joint distributions in the world model's latent space, outperforming adversarial approaches.
Contribution
It presents a novel RND-based reward model for world model IL that enhances stability and achieves expert-level performance across diverse benchmarks.
Findings
Demonstrates improved stability over adversarial IL methods.
Achieves expert-level performance in locomotion and manipulation tasks.
Validates effectiveness across DMControl, Meta-World, and ManiSkill2.
Abstract
Imitation Learning (IL) has achieved remarkable success across various domains, including robotics, autonomous driving, and healthcare, by enabling agents to learn complex behaviors from expert demonstrations. However, existing IL methods often face instability challenges, particularly when relying on adversarial reward or value formulations in world model frameworks. In this work, we propose a novel approach to online imitation learning that addresses these limitations through a reward model based on random network distillation (RND) for density estimation. Our reward model is built on the joint estimation of expert and behavioral distributions within the latent space of the world model. We evaluate our method across diverse benchmarks, including DMControl, Meta-World, and ManiSkill2, showcasing its ability to deliver stable performance and achieve expert-level results in both…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Principled Formulation for Stable Imitation Learning: CDRED directly tackles the pervasive instability in adversarially trained reward or value formulations (e.g., GAIL, IQ-Learn) by replacing the adversarial component with a distributional density estimation (RND-based) framework, coupled for both expert and behavioral distributions in the latent space. Figure 13 and Figure 14, along with Appendix E.7 (Table 9), clearly quantify the inherent training instability and gradient norm blowups tha
1. Overreliance on Prior Work for Mathematical Innovation: A central technical ingredient—bias correction for distributional RND, estimation of occurrence frequencies, and consistency—is largely imported from Yang et al. (2024). The new contribution is in coupling expert and behavioral distribution estimation in the latent space, yet this coupling is principally an architectural/conceptual extension rather than a fundamentally new algorithm. More serious proof of theoretical improvements (e.g.,
1. **Stable reward formulation**: The coupled RND mechanism jointly estimates expert and behavioral latent distributions, balancing exploitation and exploration. This approach intuitively prevents collapse to sub-optimal expert matching while avoiding the instability of adversarial IL. 2. **Empirical robustness**: CDRED achieves expert-level performance on multiple domains and outperforms IQ-MPC and CFIL on Meta-World and ManiSkill2, while matching their performance on DMControl. Stability metri
1. **Clarity and self-containment**: Section 3 introduces the RND correction from Yang et al. (2024) before CDRED’s own contribution, which causes conceptual fragmentation. The derivation of Equations 7 and 8 is abrupt and not self-contained, leaving the reader to infer critical relationships between the bias correction term and the coupled reward. The paper is also hard to follow. The notation is heavy, and the description of the reward construction involves many symbols (e.g., $,\epsilon$, $b$
1. The paper identifies a real and important problem: instability in adversarial imitation learning within world models. The replacement of adversarial training with density-based reward modeling is reasonable. 2. The paper evaluates across multiple benchmarks and provides extensive ablations.
The paper’s overall motivation is not sufficiently clear. Although each component in the proposed framework (e.g., the coupled density estimation, RND-based reward model, and world model integration) appears to be useful on its own—as supported by the ablation studies—the connections among these components are weak and feel somewhat ad hoc. The method seems to be a combination of existing techniques rather than a unified, principled design. In particular, the entire paper is built upon the world
1. The paper identifies a well-known issue—training instability in adversarial imitation learning—and proposes a conceptually grounded alternative using density estimation via random network distillation. 2. Incorporating the CDRED reward model into a decoder-free world model (TD-MPC–style) is technically elegant and aligns with recent advances in model-based RL, improving sample efficiency and planning stability. 3. Empirical results across DMControl, Meta-World, and ManiSkill2 convincingly s
1. The proposed Coupled Distributional Random Expert Distillation (CDRED) aims to stabilize imitation learning by coupling expert and behavioral distributions in latent space. However, its innovation over prior methods such as RED (Wang et al., 2019) and CFIL (Freund et al., 2023) remains unclear. The paper should better distinguish CDRED’s theoretical advantages and explain why coupling yields more stable learning. 2. Sections 3.1–3.2 introduce correction terms and balancing factors (α, ζ), bu
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
