Learning What to Do and What Not To Do: Offline Imitation from Expert and Undesirable Demonstrations

Huy Hoang; Tien Mai; Pradeep Varakantham; Tanvi Verma

arXiv:2505.21182·cs.LG·May 28, 2025

Learning What to Do and What Not To Do: Offline Imitation from Expert and Undesirable Demonstrations

Huy Hoang, Tien Mai, Pradeep Varakantham, Tanvi Verma

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a new offline imitation learning method that leverages both expert and undesirable demonstrations by optimizing a difference of KL divergences, resulting in improved performance without adversarial training.

Contribution

The paper proposes a novel formulation that incorporates undesirable demonstrations into offline imitation learning, providing a convex objective under certain conditions and avoiding adversarial training.

Findings

01

Outperforms state-of-the-art baselines on standard benchmarks.

02

Handles both positive and negative demonstrations in a unified framework.

03

Provides a stable, non-adversarial training objective.

Abstract

Offline imitation learning typically learns from expert and unlabeled demonstrations, yet often overlooks the valuable signal in explicitly undesirable behaviors. In this work, we study offline imitation learning from contrasting behaviors, where the dataset contains both expert and undesirable demonstrations. We propose a novel formulation that optimizes a difference of KL divergences over the state-action visitation distributions of expert and undesirable (or bad) data. Although the resulting objective is a DC (Difference-of-Convex) program, we prove that it becomes convex when expert demonstrations outweigh undesirable demonstrations, enabling a practical and stable non-adversarial training objective. Our method avoids adversarial training and handles both positive and negative demonstrations in a unified framework. Extensive experiments on standard offline imitation learning…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

### **Strength 1. Practically useful problem setting** The paper tackles a practically important and underexplored problem in offline imitation learning, which uses expert and explicitly undesirable demonstrations simultaneously. The authors provide a mathematically grounded formulation based on a difference of KL divergences, offering a clear and principled way to encode both imitation and avoidance objectives within a unified convex optimization framework. ### **Strength 2. Clear presentat

Weaknesses

### **Weakness 1. Insufficient justification for adopting the IQ-Learn framework** From the main objective (Eq. 2) to the IQ-Learn-based objective introduced around line 223, the concrete derivation steps are missing even in Appendix. Moreover, the motivation for explicitly adopting the IQ-Learn framework remains underdeveloped. In Appendix B.1, the authors discuss the limitation of applying Lagrangian duality to general $f$-divergences but do not explain why IQ-Learn is preferable to KL-specif

Reviewer 02Rating 6Confidence 4

Strengths

- The authors present a clear, principled goal: optmize $f(d_\pi) = D_{\mathrm{KL}}(d_\pi \parallel d_G) - \alpha D_{\mathrm{KL}}(d_\pi \parallel d_B)$ and show that it is convex in the occupancy measure when $\alpha \le 1$ (Proposition 4.1). This convexity enables a stable, non-adversarial optimization via duality (Section 4 and Appendix A). In short, framing the problem as a difference of KLs and proving convexity is clean and actionable. - The authors recognize that the exponential terms in

Weaknesses

- The convexity only holds for KL divergence, as stated in the paper. Mininizing KL divergence has its own advantages and disadvantages due to its characteristics. It only holds for $\alpha$ < 1 with Appendix D.11 showing performance degradation when $\alpha$ >= 1. This is a limitation when one wishes to emphasize avoidance of bad demonstration. - A significant limitation of this work is the lack of comparison with other DICE-based approaches that could utilize the same contrastive reward sign

Reviewer 03Rating 4Confidence 5

Strengths

1. The “learn what to do and what not to do” framing is intuitively appealing and practically relevant to safety-critical applications. 1. The paper provides a new formulation that combines attraction toward expert data and repulsion from undesired data, and proves its convexity under the condition $\alpha<1$. 1. The paper avoids unstable training by deriving a dual Q-learning formulation. 1. The proposed approach achieves good empirical performance improvements across diverse tasks and datasets

Weaknesses

1. The paper never clearly defines what qualifies as undesired. Are they failed trajectories, suboptimal but safe data, or catastrophic behaviors? The difference matters: discouraging mild inefficiency vs. avoiding dangerous actions are conceptually distinct. Without a formal definition or metric for “undesired,” it is unclear how $\alpha$ or the classifier boundaries correspond to actual safety or task performance. This ambiguity also limits the interpretability of results, since different task

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Multi-Agent Systems and Negotiation · Ethics and Social Impacts of AI