MoTVLA: A Vision-Language-Action Model with Unified Fast-Slow Reasoning
Wenhui Huang, Changhe Chen, Han Qi, Chen Lv, Yilun Du, Heng Yang

TL;DR
MoTVLA is a unified vision-language-action model that combines fast and slow reasoning to improve robot manipulation efficiency and language steerability, leveraging pre-trained models and domain-specific transformers.
Contribution
It introduces a mixture-of-transformers framework that integrates fast domain-specific reasoning with generalist perception, enhancing robot policy learning and language steerability.
Findings
Outperforms existing models in manipulation tasks
Improves language steerability in robotic policies
Demonstrates efficiency in real-world robot experiments
Abstract
Integrating visual-language instructions into visuomotor policies is gaining momentum in robot learning for enhancing open-world generalization. Despite promising advances, existing approaches face two challenges: limited language steerability when no generated reasoning is used as a condition, or significant inference latency when reasoning is incorporated. In this work, we introduce MoTVLA, a mixture-of-transformers (MoT)-based vision-language-action (VLA) model that integrates fast-slow unified reasoning with behavior policy learning. MoTVLA preserves the general intelligence of pre-trained VLMs (serving as the generalist) for tasks such as perception, scene understanding, and semantic planning, while incorporating a domain expert, a second transformer that shares knowledge with the pretrained VLM, to generate domain-specific fast reasoning (e.g., robot motion decomposition), thereby…
Peer Reviews
Decision·Submitted to ICLR 2026
1. This paper proposes a novel MoT-based unified fast-slow reasoning framework, addressing the critical trade-off between language steerability and inference latency in existing VLA models. The "generalist-domain expert" design with knowledge sharing is conceptually sound, filling a gap in current VLA research. 2. To ensure both general intelligence retention and policy efficiency, the authors adopt a "decomposition-composition-decomposition" p ipeline for reasoning and a two-stage training rec
1. The real-world dataset used for training the action expert is excessively small and relies heavily on manual annotation, which poses significant challenges for data scaling. This not only increases the risk of overfitting but also restricts the model to a narrow range of action skills (e.g., simple pick-and-place, stacking, and pulling tasks). Consequently, the model’s reliability and scalability in diverse real-world scenarios are compromised. 2. The results of reasoning tasks (Section 4.2)
S1. Using bidirectional token-wise reasoning can effectively accelerate the inference process. S2. This paper conducts benchmarks on both general-domain and robotic-domain reasoning to evaluate their respective performances.
W1. **How are slow reasoning and fast reasoning defined?** I don’t find it convincing that general-domain reasoning is inherently slow while robotic-domain reasoning is fast. Moreover, reasoning in the robotic domain is more crucial for manipulation. Wouldn’t adopting token-wise prediction potentially lead to insufficient reasoning capability? W2. **The paper lacks both motivation and ablation studies.** It does not verify why and how general and robotic-domain reasoning improve manipulation pe
- Originality: The unified fast–slow reasoning architecture via a Mixture-of-Transformers is a creative synthesis that removes a key limitation in prior VLA systems—either poor steerability without explicit reasoning or high latency with autoregressive CoT—by sharing global attention between a pretrained generalist and a token-wise domain expert. The decomposition–composition–decomposition design and the use of fast motion decomposition to condition diffusion policies is a novel and pragmatic fo
- Switching between the slow and fast models is not yet autonomous. The generalist is used only for optional high-level reasoning or dialogue at the outset. During execution, it can be activated only by the operator. - Ablations are conducted on a simple cube-stacking task, where explicit reasoning appears less necessary.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Robot Manipulation and Learning
