TwinVLA: Data-Efficient Bimanual Manipulation with Twin Single-Arm Vision-Language-Action Models

Hokyun Im; Euijin Jeong; Andrey Kolobov; Jianlong Fu; Youngwoon Lee

arXiv:2511.05275·cs.RO·February 24, 2026

TwinVLA: Data-Efficient Bimanual Manipulation with Twin Single-Arm Vision-Language-Action Models

Hokyun Im, Euijin Jeong, Andrey Kolobov, Jianlong Fu, Youngwoon Lee

PDF

Open Access 1 Models 3 Reviews

TL;DR

TwinVLA introduces a modular approach that composes two pretrained single-arm vision-language-action models to efficiently perform bimanual manipulation tasks, reducing data needs and improving performance.

Contribution

It presents a novel modular framework that combines pretrained single-arm models for effective bimanual manipulation without extensive bimanual training data.

Findings

01

Outperforms monolithic models on bimanual tasks in real and simulated environments.

02

Requires no additional bimanual pretraining data.

03

Narrowing the gap to state-of-the-art models with extensive proprietary data.

Abstract

Vision-language-action models (VLAs) trained on large-scale robotic datasets have demonstrated strong performance on manipulation tasks, including bimanual tasks. However, because most public datasets focus on single-arm demonstrations, adapting VLAs for bimanual tasks typically requires substantial additional bimanual data and fine-tuning. To address this challenge, we introduce TwinVLA, a modular framework that composes two copies of a pretrained single-arm VLA into a coordinated bimanual VLA. Unlike monolithic cross-embodiment models trained on mixtures of single-arm and bimanual data, TwinVLA improves both data efficiency and performance by composing pretrained single-arm policies. Across diverse bimanual tasks in real-world and simulation settings, TwinVLA outperforms a comparably-sized monolithic RDT-1B model without requiring any bimanual pretraining. Furthermore, it narrows the…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

The motivation is sound, and the results demonstrate that it is a more efficient training approach. The experimental results and supplementary materials provide additional details, and the experiments on weight changes in particular highlight the advantages of the proposed method.

Weaknesses

1. The implementation details are not clearly described. The approach to handling bimanual manipulation appears somewhat similar to previous works, or at least it’s not evident how it fundamentally differs from other methods. More implementation details are needed from the authors to ensure an accurate evaluation. [1] Anybimanual: Transferring Single-Arm Policy for General Bimanual Manipulation [2] InterACT: Inter-dependency Aware Action Chunking with Hierarchical Attention Transformers for Bim

Reviewer 02Rating 4Confidence 4

Strengths

- The evaluation is comprehensive and convincing, covering both simulation and real-world settings. The experiments span two simulation benchmarks and physical robot trials, with tasks designed to test both coordinated bimanual manipulation and asymmetric master-assistant roles. - The method draws inspiration from neuroscience principles, providing an interesting conceptual link between biological motor coordination and modular robotic control. - The approach significantly reduces VRAM and com

Weaknesses

- The core architecture and data-flow formulation of TwinVLA—arguably the central contribution—would benefit from a more formal and detailed presentation (*i.e.*, with equation and notations) in the main paper. Key notations and explanations are currently placed in the appendix, while the preliminaries and single-arm VLA review receive proportionally more emphasis. Streamlining those background sections could make room for a clearer, more rigorous exposition of the proposed twin-policy mechanism

Reviewer 03Rating 4Confidence 4

Strengths

Modular twin design. Proposes TwinVLA, a clear, practical architecture that couples two single-arm pretrained VLAs via causal joint attention, enabling coordinated bimanual control while preserving per-arm specialization. Efficiency mechanisms. Integrates MoE routing (and attention re-weighting on shared tokens) to avoid duplicate computation on shared inputs and to stabilize adaptation, yielding a favorable compute/memory profile. Data efficiency. Achieves strong performance with only a small

Weaknesses

Architectural advantage not causally established: The paper argues that the "twin structure" is superior to a "monolithic" one by comparing TwinVLA (1.3B) to RDT-1B (1.2B) . However, this comparison is confounded by major differences in their pre-training data and recipes. Without a size-matched monolithic VLA trained on exactly the same 0.5M single-arm pretraining data and the same 50 bimanual demos (with equal token/compute), the observed gains cannot be causally attributed to the twin. Biman

Code & Models

Models

🤗
jellyho/TwinVLA
model· 191 dl· ♡ 1
191 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning