TwinBrainVLA: Unleashing the Potential of Generalist VLMs for Embodied Tasks via Asymmetric Mixture-of-Transformers

Bin Yu; Shijie Lian; Xiaopeng Lin; Yuliang Wei; Zhaolong Shen; Changti Wu; Yuzhuo Miao; Xinming Wang; Bailing Wang; Cong Huang; Kai Chen

arXiv:2601.14133·cs.RO·February 2, 2026

TwinBrainVLA: Unleashing the Potential of Generalist VLMs for Embodied Tasks via Asymmetric Mixture-of-Transformers

Bin Yu, Shijie Lian, Xiaopeng Lin, Yuliang Wei, Zhaolong Shen, Changti Wu, Yuzhuo Miao, Xinming Wang, Bailing Wang, Cong Huang, Kai Chen

PDF

Open Access

TL;DR

TwinBrainVLA introduces a dual-path vision-language-action model that preserves pre-trained knowledge while fine-tuning for robotic tasks, significantly improving manipulation performance.

Contribution

It proposes a novel asymmetric mixture-of-transformers architecture with dual VLM pathways to retain general knowledge during robotic fine-tuning.

Findings

01

Achieves superior performance on SimplerEnv and RoboCasa benchmarks.

02

Effectively mitigates catastrophic forgetting in robotic fine-tuning.

03

Enhances complex manipulation task success rates.

Abstract

The fundamental premise of Vision-Language-Action (VLA) models is to harness the extensive general capabilities of pre-trained Vision-Language Models (VLMs) for generalized embodied intelligence. However, standard robotic fine-tuning inevitably disrupts the pre-trained feature space, leading to "catastrophic forgetting" that compromises the general visual understanding we aim to leverage. To effectively utilize the uncorrupted general capabilities of VLMs for robotic tasks, we propose TwinBrainVLA, which coordinates two isomorphic VLM pathways: a frozen generalist (also called "Left Brain") and a trainable specialist (also called "Right Brain"). Our architecture utilizes a Asymmetric Mixture-of-Transformers (AsyMoT) mechanism, enabling the Right Brain to dynamically query and fuse intact semantic knowledge from the Left Brain with proprioceptive states. This fused representation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Domain Adaptation and Few-Shot Learning