RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation
Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, Weiliang Deng, Yubin Guo, Tian Nian, Xuanbing Xie, Qiangyu Chen, Kailun Su, Tianling Xu, Guodong Liu, Mengkang Hu, Huan-ang Gao, Kaixuan Wang

TL;DR
RoboTwin 2.0 introduces a scalable data generation framework with domain randomization for robust bimanual robotic manipulation, significantly improving sim-to-real transfer and policy robustness across diverse tasks and environments.
Contribution
The paper presents RoboTwin 2.0, a novel scalable framework for generating diverse, realistic data and benchmarks for dual-arm manipulation with strong domain randomization techniques.
Findings
10.9% increase in code generation success rate
367% improvement in policy performance with minimal real data
Effective robustness to environmental variations
Abstract
Simulation-based data synthesis has emerged as a powerful paradigm for advancing real-world robotic manipulation. Yet existing datasets remain insufficient for robust bimanual manipulation due to (1) the lack of scalable task generation methods and (2) oversimplified simulation environments. We present RoboTwin 2.0, a scalable framework for automated, large-scale generation of diverse and realistic data, together with unified evaluation protocols for dual-arm manipulation. At its core is RoboTwin-OD, an object library of 731 instances across 147 categories with semantic and manipulation-relevant annotations. Building on this, we design an expert data synthesis pipeline that leverages multimodal language models (MLLMs) and simulation-in-the-loop refinement to automatically generate task-level execution code. To improve sim-to-real transfer, RoboTwin 2.0 applies structured domain…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper is generally well-written, with a clearly written motivation section, logical flow, and legible figures - A comprehensive simulation framework with large-scale assets and the ability to generate tasks and datasets automatically with LLMs - Comprehensive experiments, first in simulation to benchmark numerous policy learning algorithms, and in the real world to show the utility of the simulation data for learning real world tasks
- Details on the expert code generation pipeline are very sparse, and while some details are provided in the appendix, it would be helpful to provide additional context on all the pieces outlined in Figure 3, in the main text (assumptions, inputs, outputs, etc). - It would be helpful to break down the domain randomization factors further. Which factors contribute most to learning robust behaviors, among scene clutter, lighting, tabletop heights, textures, etc? - The real robot experiments use a
1.RoboTwin 2.0 introduces a closed-loop MLLM-based code-generation system that can automatically create, execute, and refine robot-control scripts until they succeed. This minimizes human supervision while maintaining data quality, allowing the collection of 100 k+ expert trajectories across diverse dual-arm tasks. It demonstrates how large-scale synthetic data for manipulation can be generated programmatically and verified via multimodal feedback. 2.The framework emphasizes five-axis domain ra
1.The paper claims existing benchmarks have weak domain randomization. Is there empirical evidence comparing OOD manipulation benchmarks—for example, The Colosseum, GEMBench, or others? 2.The sim-to-real evaluation covers four bimanual tasks on a single platform (COBOT-Magic) with one policy backbone (RDT), which limits how broadly the gains generalize to other tasks, robots, and policies. Are the real-world results statistically significant, and if so, what tests and effect sizes are reported?
1. Dedicated effort into curating tasks with diverse object assets, domain randomization schemes, and support multiple robot embodiments. These are all important aspects in studying bimanual robot manipulation. 2. Good presentation quality. The authors provided extensive details and visualizations for the task design and simulation implementation details. It's difficult to organize the content when it comes to writing a benchmark paper, and the authors have presented information in a clear way
1. Tasks are limited to table-top manipulation of mainly rigid objects. Compared to some prior works (e.g. RoboCasa), the lack of scene-level assets such as kitchen countertops or cabinets limits the diversity of possible tasks. This is perhaps beyond the scope of the current work, but a larger scene setting would make the benchmark much more useful in studying more tasks and diverse robot behaviors. 2. Lack of qualitative results on the policy learning performance. It would have been much cle
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
