RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Tianxing Chen; Zanxin Chen; Baijun Chen; Zijian Cai; Yibin Liu; Zixuan Li; Qiwei Liang; Xianliang Lin; Yiheng Ge; Zhenyu Gu; Weiliang Deng; Yubin Guo; Tian Nian; Xuanbing Xie; Qiangyu Chen; Kailun Su; Tianling Xu; Guodong Liu; Mengkang Hu; Huan-ang Gao; Kaixuan Wang; Zhixuan Liang; Yusen Qin; Xiaokang Yang; Ping Luo; Yao Mu

arXiv:2506.18088·cs.RO·August 28, 2025

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, Weiliang Deng, Yubin Guo, Tian Nian, Xuanbing Xie, Qiangyu Chen, Kailun Su, Tianling Xu, Guodong Liu, Mengkang Hu, Huan-ang Gao, Kaixuan Wang

PDF

1 Models 4 Datasets 3 Reviews

TL;DR

RoboTwin 2.0 introduces a scalable data generation framework with domain randomization for robust bimanual robotic manipulation, significantly improving sim-to-real transfer and policy robustness across diverse tasks and environments.

Contribution

The paper presents RoboTwin 2.0, a novel scalable framework for generating diverse, realistic data and benchmarks for dual-arm manipulation with strong domain randomization techniques.

Findings

01

10.9% increase in code generation success rate

02

367% improvement in policy performance with minimal real data

03

Effective robustness to environmental variations

Abstract

Simulation-based data synthesis has emerged as a powerful paradigm for advancing real-world robotic manipulation. Yet existing datasets remain insufficient for robust bimanual manipulation due to (1) the lack of scalable task generation methods and (2) oversimplified simulation environments. We present RoboTwin 2.0, a scalable framework for automated, large-scale generation of diverse and realistic data, together with unified evaluation protocols for dual-arm manipulation. At its core is RoboTwin-OD, an object library of 731 instances across 147 categories with semantic and manipulation-relevant annotations. Building on this, we design an expert data synthesis pipeline that leverages multimodal language models (MLLMs) and simulation-in-the-loop refinement to automatically generate task-level execution code. To improve sim-to-real transfer, RoboTwin 2.0 applies structured domain…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 8Confidence 4

Strengths

- The paper is generally well-written, with a clearly written motivation section, logical flow, and legible figures - A comprehensive simulation framework with large-scale assets and the ability to generate tasks and datasets automatically with LLMs - Comprehensive experiments, first in simulation to benchmark numerous policy learning algorithms, and in the real world to show the utility of the simulation data for learning real world tasks

Weaknesses

- Details on the expert code generation pipeline are very sparse, and while some details are provided in the appendix, it would be helpful to provide additional context on all the pieces outlined in Figure 3, in the main text (assumptions, inputs, outputs, etc). - It would be helpful to break down the domain randomization factors further. Which factors contribute most to learning robust behaviors, among scene clutter, lighting, tabletop heights, textures, etc? - The real robot experiments use a

Reviewer 02Rating 4Confidence 4

Strengths

1.RoboTwin 2.0 introduces a closed-loop MLLM-based code-generation system that can automatically create, execute, and refine robot-control scripts until they succeed. This minimizes human supervision while maintaining data quality, allowing the collection of 100 k+ expert trajectories across diverse dual-arm tasks. It demonstrates how large-scale synthetic data for manipulation can be generated programmatically and verified via multimodal feedback. 2.The framework emphasizes five-axis domain ra

Weaknesses

1.The paper claims existing benchmarks have weak domain randomization. Is there empirical evidence comparing OOD manipulation benchmarks—for example, The Colosseum, GEMBench, or others? 2.The sim-to-real evaluation covers four bimanual tasks on a single platform (COBOT-Magic) with one policy backbone (RDT), which limits how broadly the gains generalize to other tasks, robots, and policies. Are the real-world results statistically significant, and if so, what tests and effect sizes are reported?

Reviewer 03Rating 6Confidence 4

Strengths

1. Dedicated effort into curating tasks with diverse object assets, domain randomization schemes, and support multiple robot embodiments. These are all important aspects in studying bimanual robot manipulation. 2. Good presentation quality. The authors provided extensive details and visualizations for the task design and simulation implementation details. It's difficult to organize the content when it comes to writing a benchmark paper, and the authors have presented information in a clear way

Weaknesses

1. Tasks are limited to table-top manipulation of mainly rigid objects. Compared to some prior works (e.g. RoboCasa), the lack of scene-level assets such as kitchen countertops or cabinets limits the diversity of possible tasks. This is perhaps beyond the scope of the current work, but a larger scene setting would make the benchmark much more useful in studying more tasks and diverse robot behaviors. 2. Lack of qualitative results on the policy learning performance. It would have been much cle

Code & Models

Models

🤗
iMihayo/custom_robotwin
model

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.