SEABO: A Simple Search-Based Method for Offline Imitation Learning
Jiafei Lyu, Xiaoteng Ma, Le Wan, Runze Liu, Xiu Li, Zongqing Lu

TL;DR
SEABO is a simple, search-based offline imitation learning method that assigns rewards based on proximity to expert demonstrations, enabling effective learning from unlabeled data with minimal expert input.
Contribution
Introduces SEABO, a novel unsupervised search-based offline imitation learning approach that performs well with limited expert data and no reward labels.
Findings
Achieves competitive performance with ground-truth rewards using only a single expert trajectory.
Outperforms prior reward learning and offline IL methods on multiple datasets.
Works effectively even when expert demonstrations contain only observations.
Abstract
Offline reinforcement learning (RL) has attracted much attention due to its ability in learning from static offline datasets and eliminating the need of interacting with the environment. Nevertheless, the success of offline RL relies heavily on the offline transitions annotated with reward labels. In practice, we often need to hand-craft the reward function, which is sometimes difficult, labor-intensive, or inefficient. To tackle this challenge, we set our focus on the offline imitation learning (IL) setting, and aim at getting a reward function based on the expert data and unlabeled data. To that end, we propose a simple yet effective search-based offline IL method, tagged SEABO. SEABO allocates a larger reward to the transition that is close to its closest neighbor in the expert demonstration, and a smaller reward otherwise, all in an unsupervised learning manner. Experimental results…
Peer Reviews
Decision·ICLR 2024 poster
1. The proposed approach is both novel and simple, and its implementation is efficient due to the use of a KD-tree, without the need for training an extra discriminator. 2. This paper focuses on the context of single-expert-demonstration IL tasks, which is an area of growing interest in the field. 3. I find the discussion on using different search algorithms in Section 5.4 and Appendix Section C interesting. 4. The Limitations section in the appendix is highly appreciated, as it provides valuabl
1. I'm concerned about the use of Euclidean distance and would suggest that the authors include references justifying the use of this distance metric. This is crucial because there might be scenarios where states that are close in Euclidean distance are, in fact, far apart when accounting for the transitions within the Markov Decision Process (MDP). This particular challenge doesn't arise in discriminator-based methods, mainly due to the use of an additional neural network during training. 2. I
The idea is simple and well-explained in the paper. The empirical validation is thorough, with results across many D4RL locomotion tasks and a small number of manipulation tasks. A representative set of baselines is used and the method shows strong empirical performance. The simple proposed method beats more complex alternatives that need to use optimal transport etc.
I don't have many issues with the paper. The method has some limitations (see below), but I don't think this invalidates the contributions of the current paper. One minor weakness is that the approach is mostly evaluated on locomotion tasks for which precision is not the most critical. It could be great to evaluate it on a challenging, long-horizon manipulation task to test the limits of the method. For example the IKEA Furniture assembly benchmark could provide a nice test bed and I would be
- This paper proposes a very simple algorithm that achieves good performance relative to more complex methods. I think this type of work has good value to the community in that it introduces easy-to-reproduce results and discourages over-engineering of methods. - The paper is clear in presentation and mostly well written.
- There is only empirical analysis of the proposed method. I believe there are certain tasks where SEABO would perform poorly, such as a cliff-walking type of task where there is a precise boundary between what is accetable and what is a failure. - I hypothesize that the approach will also only work in lower dimensional control environments. This is because the method relies heavily on a distance function, and this could suffer from the curse of dimensionality in more complex environments. M
Code & Models
Videos
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Human Motion and Animation
MethodsSparse Evolutionary Training · Focus
