PRISM: Performer RS-IMLE for Single-pass Multisensory Imitation Learning
Amisha Bhaskar, Pratap Tokekar, Stefano Di Cairano, Alexander Schperberg

TL;DR
PRISM is a fast, multisensory imitation learning policy that outperforms state-of-the-art generative models in real-world robotic tasks by combining a novel IMLE-based approach with a multisensory encoder.
Contribution
We introduce PRISM, a single-pass, multisensory imitation learning method using a batch-global rejection-sampling IMLE variant with a Performer architecture, enabling real-time control.
Findings
PRISM outperforms diffusion policies by 10-25% success rate on real-world tasks.
PRISM achieves 30-50 Hz control frequency, suitable for real-time applications.
PRISM improves success rates by ~25% on CALVIN benchmark and reduces trajectory jerk significantly.
Abstract
Robotic imitation learning typically requires models that capture multimodal action distributions while operating at real-time control rates and accommodating multiple sensing modalities. Although recent generative approaches such as diffusion models, flow matching, and Implicit Maximum Likelihood Estimation (IMLE) have achieved promising results, they often satisfy only a subset of these requirements. To address this, we introduce PRISM, a single-pass policy based on a batch-global rejection-sampling variant of IMLE. PRISM couples a temporal multisensory encoder (integrating RGB, depth, tactile, audio, and proprioception) with a linear-attention generator using a Performer architecture. We demonstrate the efficacy of PRISM on a diverse real-world hardware suite, including loco-manipulation using a Unitree Go2 with a 7-DoF arm D1 and tabletop manipulation with a UR5 manipulator. Across…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
1. The paper presents extensive simulation experiments across diverse tasks from **MetaWorld**, **RoboMimic**, and **CALVIN**, covering a wide range of control scenarios with varying numbers of available modalities. 2. The paper includes **real-robot loco-manipulation experiments**, which validate the effectiveness of the proposed **PRISM** method in real-world settings. 3. The paper is **well-written** and provides comprehensive details.
1. The proposed **PRISM** method is compared against different sets of baselines across benchmarks. This appears to result from directly adopting results from the original papers, which may not be an ideal practice (please correct me if I am mistaken). It would strengthen the evaluation if the authors could include a consistent set of baselines or at least some major baselines should be available across all benchmarks. Additionally, the naming of baselines varies between benchmarks, which introd
- The central claim is that multimodal diversity can be achieved without iterative sampling, which is conceptually reasonable and empirically validated. It approximates the data distribution by ensuring that every expert trajectory is covered by at least one generated sample, which is a principled alternative to adversarial or diffusion-based modeling, avoiding training instability and heavy computation. - The model’s bidirectional attention inherently enforces motion continuity and temporal con
- RS-IMLE matches samples implicitly but doesn’t yield a tractable log-likelihood or uncertainty measure, limiting interpretability and making it less suitable for planning or risk-sensitive control compared to probabilistic diffusion policies. - The threshold scheduling is empirical. Performance can vary significantly with $\epsilon_{RS}$ choice and the paper lacks theoretical guidance for selecting it. Tasks are short or moderately long. It’s unclear whether the pipeline retains temporal consi
- The authors tested their method in simulation and on real-robot experiments - Outperforms diffusion and flow policies in both simulation and real-world experiments by 10–25%. - Ablation studies in the appendix help to understand the impact of the performance of the different components of the method
- One of the main motivations for this work is multimodal action distributions. Yet, the authors did not show that their method is actually capable of learning these multimodal behaviours - The method has many hyperparameter such as $K’$, $\epsilon$, Top-K weight - Demonstrations limited to small-scale tasks (e.g., pick-and-place, insertion); scalability to larger problems are untested - The proposed loss combines many heuristics. There are cleaner probabilistic formulations with the same goal o
- The paper provides strong empirical validation through both extensive simulation benchmarks (MetaWorld, CALVIN, Robomimic) and real-world deployment on a Unitree GO2 manipulation platform. - RS-IMLE enables efficient parallel candidate generation and selection, allowing single-pass inference with low latency and avoiding the costly iterative denoising loops of diffusion and flow-matching methods. - The Performer-based architecture supports real-time multisensory control, scaling effectively
- The flow in the methodology section is sometimes difficult to follow, particularly around Section 4.3, where the introduction of the robust sequence distance and RS-IMLE steps feels abrupt. Providing clearer preliminaries, unified notation, and a more gradual build-up to the rejection-sampling formulation (e.g., by first introducing standard IMLE objectives before the proposed batch-global extension) would help improve clarity and overall conceptual continuity. - The paper lacks explicit repo
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Reinforcement Learning in Robotics · Social Robot Interaction and HRI
