MoTVLA: A Vision-Language-Action Model with Unified Fast-Slow Reasoning

Wenhui Huang; Changhe Chen; Han Qi; Chen Lv; Yilun Du; Heng Yang

arXiv:2510.18337·cs.RO·October 24, 2025

MoTVLA: A Vision-Language-Action Model with Unified Fast-Slow Reasoning

Wenhui Huang, Changhe Chen, Han Qi, Chen Lv, Yilun Du, Heng Yang

PDF

Open Access 3 Reviews

TL;DR

MoTVLA is a unified vision-language-action model that combines fast and slow reasoning to improve robot manipulation efficiency and language steerability, leveraging pre-trained models and domain-specific transformers.

Contribution

It introduces a mixture-of-transformers framework that integrates fast domain-specific reasoning with generalist perception, enhancing robot policy learning and language steerability.

Findings

01

Outperforms existing models in manipulation tasks

02

Improves language steerability in robotic policies

03

Demonstrates efficiency in real-world robot experiments

Abstract

Integrating visual-language instructions into visuomotor policies is gaining momentum in robot learning for enhancing open-world generalization. Despite promising advances, existing approaches face two challenges: limited language steerability when no generated reasoning is used as a condition, or significant inference latency when reasoning is incorporated. In this work, we introduce MoTVLA, a mixture-of-transformers (MoT)-based vision-language-action (VLA) model that integrates fast-slow unified reasoning with behavior policy learning. MoTVLA preserves the general intelligence of pre-trained VLMs (serving as the generalist) for tasks such as perception, scene understanding, and semantic planning, while incorporating a domain expert, a second transformer that shares knowledge with the pretrained VLM, to generate domain-specific fast reasoning (e.g., robot motion decomposition), thereby…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

1. This paper proposes a novel MoT-based unified fast-slow reasoning framework, addressing the critical trade-off between language steerability and inference latency in existing VLA models. The "generalist-domain expert" design with knowledge sharing is conceptually sound, filling a gap in current VLA research. 2. To ensure both general intelligence retention and policy efficiency, the authors adopt a "decomposition-composition-decomposition" p ipeline for reasoning and a two-stage training rec

Weaknesses

1. The real-world dataset used for training the action expert is excessively small and relies heavily on manual annotation, which poses significant challenges for data scaling. This not only increases the risk of overfitting but also restricts the model to a narrow range of action skills (e.g., simple pick-and-place, stacking, and pulling tasks). Consequently, the model’s reliability and scalability in diverse real-world scenarios are compromised. 2. The results of reasoning tasks (Section 4.2)

Reviewer 02Rating 2Confidence 5

Strengths

S1. Using bidirectional token-wise reasoning can effectively accelerate the inference process. S2. This paper conducts benchmarks on both general-domain and robotic-domain reasoning to evaluate their respective performances.

Weaknesses

W1. **How are slow reasoning and fast reasoning defined?** I don’t find it convincing that general-domain reasoning is inherently slow while robotic-domain reasoning is fast. Moreover, reasoning in the robotic domain is more crucial for manipulation. Wouldn’t adopting token-wise prediction potentially lead to insufficient reasoning capability? W2. **The paper lacks both motivation and ablation studies.** It does not verify why and how general and robotic-domain reasoning improve manipulation pe

Reviewer 03Rating 6Confidence 3

Strengths

- Originality: The unified fast–slow reasoning architecture via a Mixture-of-Transformers is a creative synthesis that removes a key limitation in prior VLA systems—either poor steerability without explicit reasoning or high latency with autoregressive CoT—by sharing global attention between a pretrained generalist and a token-wise domain expert. The decomposition–composition–decomposition design and the use of fast motion decomposition to condition diffusion policies is a novel and pragmatic fo

Weaknesses

- Switching between the slow and fast models is not yet autonomous. The generalist is used only for optional high-level reasoning or dialogue at the outset. During execution, it can be activated only by the operator. - Ablations are conducted on a simple cube-stacking task, where explicit reasoning appears less necessary.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Robot Manipulation and Learning