Towards a Generalizable Bimanual Foundation Policy via Flow-based Video Prediction

Chenyou Fan; Fangzheng Yan; Chenjia Bai; Jiepeng Wang; Chi Zhang; Zhen Wang; Xuelong Li

arXiv:2505.24156·cs.CV·June 2, 2025

Towards a Generalizable Bimanual Foundation Policy via Flow-based Video Prediction

Chenyou Fan, Fangzheng Yan, Chenjia Bai, Jiepeng Wang, Chi Zhang, Zhen Wang, Xuelong Li

PDF

Open Access

TL;DR

This paper introduces a novel bimanual manipulation policy that leverages flow-based video prediction and fine-tuning of text-to-video models, enabling better generalization and reducing data requirements for dual-arm robots.

Contribution

It proposes a two-stage flow-based video prediction framework with fine-tuned text-to-flow and flow-to-video models for improved bimanual manipulation.

Findings

01

Effective in simulation and real-world dual-arm robot experiments.

02

Reduces data requirements compared to existing methods.

03

Enhances generalization of bimanual manipulation policies.

Abstract

Learning a generalizable bimanual manipulation policy is extremely challenging for embodied agents due to the large action space and the need for coordinated arm movements. Existing approaches rely on Vision-Language-Action (VLA) models to acquire bimanual policies. However, transferring knowledge from single-arm datasets or pre-trained VLA models often fails to generalize effectively, primarily due to the scarcity of bimanual data and the fundamental differences between single-arm and bimanual manipulation. In this paper, we propose a novel bimanual foundation policy by fine-tuning the leading text-to-video models to predict robot trajectories and training a lightweight diffusion policy for action generation. Given the lack of embodied knowledge in text-to-video models, we introduce a two-stage paradigm that fine-tunes independent text-to-flow and flow-to-video models derived from a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHydrology and Watershed Management Studies · Model Reduction and Neural Networks

MethodsDiffusion