VFP: Variational Flow-Matching Policy for Multi-Modal Robot Manipulation
Xuanran Zhai, Qianyou Zhao, Qiaojun Yu, Ce Hao

TL;DR
VFP introduces a variational flow-matching policy with mode-aware action generation, leveraging optimal transport and mixture-of-experts to improve multi-modal robot manipulation in simulation and real-world tasks.
Contribution
The paper proposes VFP, a novel flow-matching policy that captures multi-modality using a variational latent prior, optimal transport, and a mixture-of-experts decoder, advancing multi-modal robot manipulation.
Findings
Achieves 49% improvement in task success rate over baselines in simulation.
Outperforms standard flow-based policies on real-robot tasks.
Maintains fast inference and compact model size.
Abstract
Flow-matching-based policies have recently emerged as a promising approach for learning-based robot manipulation, offering significant acceleration in action sampling compared to diffusion-based policies. However, conventional flow-matching methods struggle with multi-modality, often collapsing to averaged or ambiguous behaviors in complex manipulation tasks. To address this, we propose the Variational Flow-Matching Policy (VFP), which introduces a variational latent prior for mode-aware action generation and effectively captures both task-level and trajectory-level multi-modality. VFP further incorporates Kantorovich Optimal Transport (K-OT) for distribution-level alignment and utilizes a Mixture-of-Experts (MoE) decoder for mode specialization and efficient inference. We comprehensively evaluate VFP on 41 simulated tasks and 3 real-robot tasks, demonstrating its effectiveness and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Reinforcement Learning in Robotics · Human Pose and Action Recognition
