From Flow to One Step: Real-Time Multi-Modal Trajectory Policies via Implicit Maximum Likelihood Estimation-based Distribution Distillation

Ju Dong; Liding Zhang; Lei Zhang; Yu Fu; Kaixin Bai; Zoltan-Csaba Marton; Zhenshan Bing; Zhaopeng Chen; Alois Christian Knoll; Jianwei Zhang

arXiv:2603.09415·cs.RO·March 11, 2026

From Flow to One Step: Real-Time Multi-Modal Trajectory Policies via Implicit Maximum Likelihood Estimation-based Distribution Distillation

Ju Dong, Liding Zhang, Lei Zhang, Yu Fu, Kaixin Bai, Zoltan-Csaba Marton, Zhenshan Bing, Zhaopeng Chen, Alois Christian Knoll, Jianwei Zhang

PDF

Open Access

TL;DR

This paper introduces a fast, single-step policy distillation method for multi-modal robotic control, combining implicit maximum likelihood estimation with a bi-directional Chamfer distance to preserve diverse behaviors and enable real-time, high-frequency decision-making.

Contribution

It presents a novel IMLE-based distillation framework with a set-level loss and a unified perception encoder, enabling real-time, multi-modal control with preserved diversity.

Findings

01

Achieves real-time control with high-frequency re-planning.

02

Preserves multi-modal distribution in a single forward pass.

03

Improves robustness under dynamic disturbances.

Abstract

Generative policies based on diffusion and flow matching achieve strong performance in robotic manipulation by modeling multi-modal human demonstrations. However, their reliance on iterative Ordinary Differential Equation (ODE) integration introduces substantial latency, limiting high-frequency closed-loop control. Recent single-step acceleration methods alleviate this overhead but often exhibit distributional collapse, producing averaged trajectories that fail to execute coherent manipulation strategies. We propose a framework that distills a Conditional Flow Matching (CFM) expert into a fast single-step student via Implicit Maximum Likelihood Estimation (IMLE). A bi-directional Chamfer distance provides a set-level objective that promotes both mode coverage and fidelity, enabling preservation of the teacher multi-modal action distribution in a single forward pass. A unified perception…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Human Motion and Animation · Motor Control and Adaptation