Student-Informed Teacher Training
Nico Messikommer, Jiaxu Xing, Elie Aljalbout, Davide Scaramuzza

TL;DR
This paper introduces a joint training framework for teacher and student policies in imitation learning, improving the student's ability to imitate complex behaviors from limited observations, especially under partial observability.
Contribution
It proposes a novel joint training method that encourages the teacher to learn behaviors more easily imitated by the student, addressing partial observability challenges.
Findings
Effective in maze navigation, quadrotor flight, and manipulation tasks.
Improves imitation accuracy under limited observation conditions.
Enhances robustness of learned behaviors in complex environments.
Abstract
Imitation learning with a privileged teacher has proven effective for learning complex control behaviors from high-dimensional inputs, such as images. In this framework, a teacher is trained with privileged task information, while a student tries to predict the actions of the teacher with more limited observations, e.g., in a robot navigation task, the teacher might have access to distances to nearby obstacles, while the student only receives visual observations of the scene. However, privileged imitation learning faces a key challenge: the student might be unable to imitate the teacher's behavior due to partial observability. This problem arises because the teacher is trained without considering if the student is capable of imitating the learned behavior. To address this teacher-student asymmetry, we propose a framework for joint training of the teacher and student policies,…
Peer Reviews
Decision·ICLR 2025 Spotlight
- Simple approach to handling asymmetry in teacher student policy distillation, where teacher is trained to minimize divergence between teacher and student. this will constrain the teacher to states where the student can explore - Interesting experimental findings. - The experiments show that teacher policies trained in this manner are better for teaching student policies. - A teacher with this constraint gets higher returns than a teacher without, because learning to act with less inputs lead
## Experimental section has several problems - Experimental section is lacking, sparse in task selection and choices of baseline, and seem a bit contrived. Several ways to fix this: - More standardized benchmarks, using envs from prior work like [1, 2] - better baselines and analysis (see below) - Sim2real of the drone / manipulator results would round it out - choice of baselines is lacking, really should compare with prior work in handling asymmetric RL / IL problems ( see references bel
1. The paper presents a novel approach to tackling the teacher-student information asymmetry problem by incorporating the imitation learning performance bound into the teacher's objective function, leading to a creative combination of ideas from imitation learning theory and practical algorithm design. 2. The proposed method is well-motivated and grounded in the theoretical foundations of imitation learning, with clear explanations for the design of the KL-divergence-based penalty and supervisor
1. Limited theoretical analysis: The paper lacks a comprehensive theoretical analysis comparing the proposed method to existing approaches. A more rigorous theoretical justification for the superiority of the method would strengthen the contributions. 2. Insufficient ablation studies: The individual contributions of the reward penalty and the KL-divergence supervision are not clearly distinguished through ablation experiments, making it difficult to assess the necessity of each component. 3. Lim
The motivation of the problem and proposed approach is presented intuitively (but lacks some clarity see clarification questions Q1-2).
* The paper’s positioning in the related work as the only one considering information asymmetry in student-teacher framework is incorrect. There are several recent works that have looked at this problem [1-5], in fact the imitability cost like the one proposed is explored in [3]. * As a consequence of the above, I think the paper misses key baselines that actually tackle problems under similar assumptions of asymmetry (the considered BC and DAgger baselines are bound to fail here) and therefore
Videos
Taxonomy
TopicsTeacher Education and Leadership Studies
