MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation

Chengshu Li; Mengdi Xu; Arpit Bahety; Hang Yin; Yunfan Jiang; Huang Huang; Josiah Wong; Sujay Garlanka; Cem Gokmen; Ruohan Zhang; Weiyu Liu; Jiajun Wu; Roberto Mart\'in-Mart\'in; Li Fei-Fei

arXiv:2510.18316·cs.RO·February 26, 2026

MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation

Chengshu Li, Mengdi Xu, Arpit Bahety, Hang Yin, Yunfan Jiang, Huang Huang, Josiah Wong, Sujay Garlanka, Cem Gokmen, Ruohan Zhang, Weiyu Liu, Jiajun Wu, Roberto Mart\'in-Mart\'in, Li Fei-Fei

PDF

Open Access 3 Reviews

TL;DR

MoMaGen introduces a constrained optimization framework for automated data generation in multi-step bimanual mobile manipulation, enabling diverse datasets and effective imitation learning policies with minimal real-world data.

Contribution

This work presents a novel constrained optimization approach for generating diverse, high-quality datasets for complex bimanual mobile manipulation tasks, addressing key challenges of reachability and visibility.

Findings

01

MoMaGen produces more diverse datasets than previous methods.

02

Policies trained on MoMaGen data perform well in real-world deployment.

03

Minimal real-world data (40 demos) suffices for successful policy fine-tuning.

Abstract

Imitation learning from large-scale, diverse human demonstrations has been shown to be effective for training robots, but collecting such data is costly and time-consuming. This challenge intensifies for multi-step bimanual mobile manipulation, where humans must teleoperate both the mobile base and two high-DoF arms. Prior X-Gen works have developed automated data generation frameworks for static (bimanual) manipulation tasks, augmenting a few human demos in simulation with novel scene configurations to synthesize large-scale datasets. However, prior works fall short for bimanual mobile manipulation tasks for two major reasons: 1) a mobile base introduces the problem of how to place the robot base to enable downstream manipulation (reachability) and 2) an active camera introduces the problem of how to position the camera to generate data for a visuomotor policy (visibility). To address…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 5

Strengths

**Originality** The paper makes several original contributions. The constrained optimization formulation elegantly unifies existing X-Gen methods while introducing novel visibility and reachability constraints specific to mobile manipulation. The distinction between hard constraints (must satisfy) and soft constraints (desirable) is intuitive and principled. Notably, the work is the first to tackle automated data generation for bimanual mobile manipulation, addressing visibility of moving camer

Weaknesses

**Limited Real-World Validation and Inadequate Scaling Analysis** The sim-to-real evaluation is insufficiently comprehensive. Only the simplest task (Pick Cup) is deployed on real hardware, achieving modest success rates (10% for WB-VIMA, 60% for π₀), which fails to validate whether the diversity benefits generalize to multi-step bimanual coordination tasks. Furthermore, while Figure 7 demonstrates data scaling trends in simulation, the choice of specific quantities (500, 1000, 2000 demonstrati

Reviewer 02Rating 6Confidence 3

Strengths

- Hard and soft visibility constraints are novel, and per experiments improve performance across tasks - Method is validated on a range of free-space and contact-rich tasks, at varying levels of randomization - Thorough experiments on the generated data and policy learning with MoMaGen data - Method works well with just one human demonstration, reducing human supervision requirements - Real-world transfer experiment (finetuned with 40 real demos), demonstrates benefits of pretraining on MoMaGen'

Weaknesses

- The current setup does not provide demonstrations for coordinated upper and lower-body control, a key developing area in mobile manipulation research - Authors note that each successful demonstration takes ~0.1-1.3 GPU hours, which can substantially limit large-scale data generation. Such large-scale generation is important for especially complex, long-horizon mobile manipulation tasks

Reviewer 03Rating 6Confidence 4

Strengths

1. **Pioneering Problem Scope**: This research is groundbreaking as it is one of the first to focus on the full-body control problem, which integrates a mobile base, active vision, and dual-arm collaborative manipulation. This novel scope holds significant, pioneering implications for general-purpose robot control. 2. **Comprehensive Dataset Contribution**: The study constructs the most comprehensive MoMaGen dataset to date within the X-Gen series. This large-scale, diverse resource is invaluabl

Weaknesses

1. **Over-reliance on Heuristics for Automation**: Despite the formal definition of numerous hard and soft constraints, the actual demonstration generation process in simulation still heavily relies on manual intervention, such as simple inverse kinematics, heuristic rules, and human-provided one-shot annotations. 2. **Limited Task Generalization and Scalability**: The study currently only covers a small number of distinct task types (a total of four). This limited diversity is significantly le

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Reinforcement Learning in Robotics · Social Robot Interaction and HRI