Towards a Generalizable Bimanual Foundation Policy via Flow-based Video Prediction
Chenyou Fan, Fangzheng Yan, Chenjia Bai, Jiepeng Wang, Chi Zhang, Zhen Wang, Xuelong Li

TL;DR
This paper introduces a novel bimanual manipulation policy that leverages flow-based video prediction and fine-tuning of text-to-video models, enabling better generalization and reducing data requirements for dual-arm robots.
Contribution
It proposes a two-stage flow-based video prediction framework with fine-tuned text-to-flow and flow-to-video models for improved bimanual manipulation.
Findings
Effective in simulation and real-world dual-arm robot experiments.
Reduces data requirements compared to existing methods.
Enhances generalization of bimanual manipulation policies.
Abstract
Learning a generalizable bimanual manipulation policy is extremely challenging for embodied agents due to the large action space and the need for coordinated arm movements. Existing approaches rely on Vision-Language-Action (VLA) models to acquire bimanual policies. However, transferring knowledge from single-arm datasets or pre-trained VLA models often fails to generalize effectively, primarily due to the scarcity of bimanual data and the fundamental differences between single-arm and bimanual manipulation. In this paper, we propose a novel bimanual foundation policy by fine-tuning the leading text-to-video models to predict robot trajectories and training a lightweight diffusion policy for action generation. Given the lack of embodied knowledge in text-to-video models, we introduce a two-stage paradigm that fine-tunes independent text-to-flow and flow-to-video models derived from a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHydrology and Watershed Management Studies · Model Reduction and Neural Networks
MethodsDiffusion
