Dexora: Open-source VLA for High-DoF Bimanual Dexterity

Zongzheng Zhang; Jingrui Pang; Zhuo Yang; Kun Li; Minwen Liao; Saining Zhang; Guoxuan Chi; Jinbang Guo; Huan-ang Gao; Modi Shi; Dongyun Ge; Yao Mu; Jiayuan Gu; Rui Chen; Hao Dong; Huazhe Xu; Li Yi; Yixin Zhu; Hang Zhao; Pengwei Wang; Shanghang Zhang; Guocai Yao; Jianyu Chen; Hongyang Li; Hao Zhao

arXiv:2605.18722·cs.RO·May 19, 2026

Dexora: Open-source VLA for High-DoF Bimanual Dexterity

Zongzheng Zhang, Jingrui Pang, Zhuo Yang, Kun Li, Minwen Liao, Saining Zhang, Guoxuan Chi, Jinbang Guo, Huan-ang Gao, Modi Shi, Dongyun Ge, Yao Mu, Jiayuan Gu, Rui Chen, Hao Dong, Huazhe Xu, Li Yi, Yixin Zhu, Hang Zhao, Pengwei Wang, Shanghang Zhang, Guocai Yao, Jianyu Chen

PDF

TL;DR

Dexora is an open-source vision-language-action system enabling high-DoF dual-arm, dual-hand manipulation with a hybrid teleoperation pipeline, large datasets, and a novel training approach, advancing embodied AI capabilities.

Contribution

Introducing Dexora, the first open-source VLA system for high-DoF bimanual manipulation, with a hybrid teleoperation pipeline, large datasets, and a data-quality-aware training method.

Findings

01

Dexora outperforms baselines on dexterous benchmarks with 66.7% success.

02

Achieves 90% success on basic manipulation tasks.

03

Demonstrates robust generalization across different embodiments.

Abstract

Vision-Language-Action (VLA) models have recently become a central direction in embodied AI, but current systems are restricted to either dual-gripper control or single-arm dexterous hand manipulation. While low-dimensional gripper control can often be handled with simpler methods, high-dimensional dexterous hand control benefits greatly from full end-to-end VLA learning. In this work, we introduce Dexora, the first open-source VLA system that natively targets dual-arm, dual-hand high-DoF manipulation. We design a hybrid teleoperation pipeline that decouples gross arm kinematics (captured with a custom exoskeleton backpack) from fine finger motion (markerless hand tracking via Apple Vision Pro), and that drives both a physical dual-arm dual-hand platform and an identical MuJoCo digital twin. Using that interface, we assemble a large training corpus: an embodiment-matched synthetic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.