Diffusion-Based Imaginative Coordination for Bimanual Manipulation
Huilin Xu, Jian Ding, Jiakun Xu, Ruixiang Wang, Jun Chen, Jinjie Mai, Yanwei Fu, Bernard Ghanem, Feng Xu, Mohamed Elhoseiny

TL;DR
This paper introduces a diffusion-based framework for bimanual manipulation in robotics, combining video and action prediction to improve coordination and success rates in complex tasks.
Contribution
It proposes a novel multi-frame latent prediction strategy and a unidirectional attention mechanism to enhance bimanual coordination efficiency.
Findings
Achieved 24.9% success rate increase on ALOHA benchmark.
Improved RoboTwin success rate by 11.1%.
Increased real-world task success by 32.5%.
Abstract
Bimanual manipulation is crucial in robotics, enabling complex tasks in industrial automation and household services. However, it poses significant challenges due to the high-dimensional action space and intricate coordination requirements. While video prediction has been recently studied for representation learning and control, leveraging its ability to capture rich dynamic and behavioral information, its potential for enhancing bimanual coordination remains underexplored. To bridge this gap, we propose a unified diffusion-based framework for the joint optimization of video and action prediction. Specifically, we propose a multi-frame latent prediction strategy that encodes future states in a compressed latent space, preserving task-relevant features. Furthermore, we introduce a unidirectional attention mechanism where video prediction is conditioned on the action, while action…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMotor Control and Adaptation
