TwinVLA: Data-Efficient Bimanual Manipulation with Twin Single-Arm Vision-Language-Action Models
Hokyun Im, Euijin Jeong, Andrey Kolobov, Jianlong Fu, Youngwoon Lee

TL;DR
TwinVLA introduces a modular approach that composes two pretrained single-arm vision-language-action models to efficiently perform bimanual manipulation tasks, reducing data needs and improving performance.
Contribution
It presents a novel modular framework that combines pretrained single-arm models for effective bimanual manipulation without extensive bimanual training data.
Findings
Outperforms monolithic models on bimanual tasks in real and simulated environments.
Requires no additional bimanual pretraining data.
Narrowing the gap to state-of-the-art models with extensive proprietary data.
Abstract
Vision-language-action models (VLAs) trained on large-scale robotic datasets have demonstrated strong performance on manipulation tasks, including bimanual tasks. However, because most public datasets focus on single-arm demonstrations, adapting VLAs for bimanual tasks typically requires substantial additional bimanual data and fine-tuning. To address this challenge, we introduce TwinVLA, a modular framework that composes two copies of a pretrained single-arm VLA into a coordinated bimanual VLA. Unlike monolithic cross-embodiment models trained on mixtures of single-arm and bimanual data, TwinVLA improves both data efficiency and performance by composing pretrained single-arm policies. Across diverse bimanual tasks in real-world and simulation settings, TwinVLA outperforms a comparably-sized monolithic RDT-1B model without requiring any bimanual pretraining. Furthermore, it narrows the…
Peer Reviews
Decision·ICLR 2026 Poster
The motivation is sound, and the results demonstrate that it is a more efficient training approach. The experimental results and supplementary materials provide additional details, and the experiments on weight changes in particular highlight the advantages of the proposed method.
1. The implementation details are not clearly described. The approach to handling bimanual manipulation appears somewhat similar to previous works, or at least it’s not evident how it fundamentally differs from other methods. More implementation details are needed from the authors to ensure an accurate evaluation. [1] Anybimanual: Transferring Single-Arm Policy for General Bimanual Manipulation [2] InterACT: Inter-dependency Aware Action Chunking with Hierarchical Attention Transformers for Bim
- The evaluation is comprehensive and convincing, covering both simulation and real-world settings. The experiments span two simulation benchmarks and physical robot trials, with tasks designed to test both coordinated bimanual manipulation and asymmetric master-assistant roles. - The method draws inspiration from neuroscience principles, providing an interesting conceptual link between biological motor coordination and modular robotic control. - The approach significantly reduces VRAM and com
- The core architecture and data-flow formulation of TwinVLA—arguably the central contribution—would benefit from a more formal and detailed presentation (*i.e.*, with equation and notations) in the main paper. Key notations and explanations are currently placed in the appendix, while the preliminaries and single-arm VLA review receive proportionally more emphasis. Streamlining those background sections could make room for a clearer, more rigorous exposition of the proposed twin-policy mechanism
Modular twin design. Proposes TwinVLA, a clear, practical architecture that couples two single-arm pretrained VLAs via causal joint attention, enabling coordinated bimanual control while preserving per-arm specialization. Efficiency mechanisms. Integrates MoE routing (and attention re-weighting on shared tokens) to avoid duplicate computation on shared inputs and to stabilize adaptation, yielding a favorable compute/memory profile. Data efficiency. Achieves strong performance with only a small
Architectural advantage not causally established: The paper argues that the "twin structure" is superior to a "monolithic" one by comparing TwinVLA (1.3B) to RDT-1B (1.2B) . However, this comparison is confounded by major differences in their pre-training data and recipes. Without a size-matched monolithic VLA trained on exactly the same 0.5M single-arm pretraining data and the same 50 bimanual demos (with equal token/compute), the observed gains cannot be causally attributed to the twin. Biman
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
