What's the Move? Hybrid Imitation Learning via Salient Points
Priya Sundaresan, Hengyuan Hu, Quan Vuong, Jeannette Bohg, Dorsa, Sadigh

TL;DR
SPHINX introduces a multimodal, salient point-based imitation learning approach that enhances robot manipulation by focusing on task-relevant features, enabling better generalization and efficiency across complex tasks.
Contribution
The paper presents SPHINX, a novel hybrid imitation learning method that combines multimodal observations and salient points for improved generalization and efficiency in robotic manipulation.
Findings
Achieves 86.7% success rate across multiple tasks.
Outperforms state-of-the-art IL baselines by 41.1%.
Generalizes to new viewpoints, distractors, and speeds.
Abstract
While imitation learning (IL) offers a promising framework for teaching robots various behaviors, learning complex tasks remains challenging. Existing IL policies struggle to generalize effectively across visual and spatial variations even for simple tasks. In this work, we introduce SPHINX: Salient Point-based Hybrid ImitatioN and eXecution, a flexible IL policy that leverages multimodal observations (point clouds and wrist images), along with a hybrid action space of low-frequency, sparse waypoints and high-frequency, dense end effector movements. Given 3D point cloud observations, SPHINX learns to infer task-relevant points within a point cloud, or salient points, which support spatial generalization by focusing on semantically meaningful features. These salient points serve as anchor points to predict waypoints for long-range movement, such as reaching target poses in free-space.…
Peer Reviews
Decision·ICLR 2025 Poster
1. This paper proposes a two-stage framework. The first stage leverages salient points for generating long-range movements, specifically as waypoints, followed by interpolation for motion planning and control. The second stage employs an image-based action similar to a diffusion policy, where actions are based on EE poses. This framework is reasonable for multi-stage manipulation tasks that require contact. 2. The paper is relatively complete, covering the proposed method, dataset collection, si
1. The novelty of the paper could be further clarified. Overall, applying a hierarchical approach to improve generalization and performance in IL is reasonable. This is novel within IL (only my point, if other reviewers can provide references, I will defer it). However, using a two-stage approach (i.e., long-range navigation + fine-grained manipulation) to handle long-term manipulation tasks is a common strategy, and I suggest the authors add a discussion on this. 2. There are some concerns rega
# Originality * While hierarchical approach has been proposed, for instance in HACMan, NDF, TAX-Pose, etc, and explicit mode-switching between policies has been explored, I haven’t really seen this sort of mode switching in imitation learning for long-ish horizon manipulation. * Giving the policy a mechanism to change its inputs is a neat design consideration * First time seeing diffusion policy in a hierarchical context. # Quality * The real-world experiments are well-designed, and demonstra
* The model’s hypothesis class (e.g. mode switching between long-distance free space motion, and short horizon fine-grained manipulation) imposes a major constraint on the kinds of problems it can represent effectively. Of course, in the extreme case, either one or the other mode can always be predicted, but in these settings no benefit / insight is offered, and hard attention switching across the inputs limits flexibility. * The assumptions about dataset preparation are a major weakness/limitat
1. **Enhanced Localization with Point Clouds**: The use of point clouds bolsters waypoint localization, offering a more resilient approach than RGB data alone. 2. **Improved Precision in Dense Manipulation**: Replacing the MLP with a diffusion policy for dense manipulation leads to greater accuracy during complex manipulation phases. 3. **Streamlined Mode Switching**: By introducing salient points, the authors provide a clear and novel mechanism for mode switching, simplifying task segmentation
1. **Reliance on Salient Points**: The approach depends on expert-determined salient points, which may limit scalability, especially if multiple annotators are involved, as this could introduce inconsistencies in training data. 2. **Lacks Details and Analysis in Experiemnts**: The baseline are too low to tell whether the baselines are correctly run and under fair comparison. It will be better to have some simple tasks which baselines can perform not that bad or provide a failure analysis about h
Videos
Taxonomy
TopicsAdvanced Vision and Imaging · Human Motion and Animation · Human Pose and Action Recognition
