Learning What to Do and What Not To Do: Offline Imitation from Expert and Undesirable Demonstrations
Huy Hoang, Tien Mai, Pradeep Varakantham, Tanvi Verma

TL;DR
This paper introduces a new offline imitation learning method that leverages both expert and undesirable demonstrations by optimizing a difference of KL divergences, resulting in improved performance without adversarial training.
Contribution
The paper proposes a novel formulation that incorporates undesirable demonstrations into offline imitation learning, providing a convex objective under certain conditions and avoiding adversarial training.
Findings
Outperforms state-of-the-art baselines on standard benchmarks.
Handles both positive and negative demonstrations in a unified framework.
Provides a stable, non-adversarial training objective.
Abstract
Offline imitation learning typically learns from expert and unlabeled demonstrations, yet often overlooks the valuable signal in explicitly undesirable behaviors. In this work, we study offline imitation learning from contrasting behaviors, where the dataset contains both expert and undesirable demonstrations. We propose a novel formulation that optimizes a difference of KL divergences over the state-action visitation distributions of expert and undesirable (or bad) data. Although the resulting objective is a DC (Difference-of-Convex) program, we prove that it becomes convex when expert demonstrations outweigh undesirable demonstrations, enabling a practical and stable non-adversarial training objective. Our method avoids adversarial training and handles both positive and negative demonstrations in a unified framework. Extensive experiments on standard offline imitation learning…
Peer Reviews
Decision·Submitted to ICLR 2026
### **Strength 1. Practically useful problem setting** The paper tackles a practically important and underexplored problem in offline imitation learning, which uses expert and explicitly undesirable demonstrations simultaneously. The authors provide a mathematically grounded formulation based on a difference of KL divergences, offering a clear and principled way to encode both imitation and avoidance objectives within a unified convex optimization framework. ### **Strength 2. Clear presentat
### **Weakness 1. Insufficient justification for adopting the IQ-Learn framework** From the main objective (Eq. 2) to the IQ-Learn-based objective introduced around line 223, the concrete derivation steps are missing even in Appendix. Moreover, the motivation for explicitly adopting the IQ-Learn framework remains underdeveloped. In Appendix B.1, the authors discuss the limitation of applying Lagrangian duality to general $f$-divergences but do not explain why IQ-Learn is preferable to KL-specif
- The authors present a clear, principled goal: optmize $f(d_\pi) = D_{\mathrm{KL}}(d_\pi \parallel d_G) - \alpha D_{\mathrm{KL}}(d_\pi \parallel d_B)$ and show that it is convex in the occupancy measure when $\alpha \le 1$ (Proposition 4.1). This convexity enables a stable, non-adversarial optimization via duality (Section 4 and Appendix A). In short, framing the problem as a difference of KLs and proving convexity is clean and actionable. - The authors recognize that the exponential terms in
- The convexity only holds for KL divergence, as stated in the paper. Mininizing KL divergence has its own advantages and disadvantages due to its characteristics. It only holds for $\alpha$ < 1 with Appendix D.11 showing performance degradation when $\alpha$ >= 1. This is a limitation when one wishes to emphasize avoidance of bad demonstration. - A significant limitation of this work is the lack of comparison with other DICE-based approaches that could utilize the same contrastive reward sign
1. The “learn what to do and what not to do” framing is intuitively appealing and practically relevant to safety-critical applications. 1. The paper provides a new formulation that combines attraction toward expert data and repulsion from undesired data, and proves its convexity under the condition $\alpha<1$. 1. The paper avoids unstable training by deriving a dual Q-learning formulation. 1. The proposed approach achieves good empirical performance improvements across diverse tasks and datasets
1. The paper never clearly defines what qualifies as undesired. Are they failed trajectories, suboptimal but safe data, or catastrophic behaviors? The difference matters: discouraging mild inefficiency vs. avoiding dangerous actions are conceptually distinct. Without a formal definition or metric for “undesired,” it is unclear how $\alpha$ or the classifier boundaries correspond to actual safety or task performance. This ambiguity also limits the interpretability of results, since different task
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Multi-Agent Systems and Negotiation · Ethics and Social Impacts of AI
