UAM: A Dual-Stream Perspective on Forgetting in VLA Training
Jianke Zhang, Yuanfei Luo, Yucheng Hu, Xiaoyu Chen, Yanjiang Guo, Ziyang Liu, Hongbin Xu, Tian Lan, Jianyu Chen

TL;DR
This paper introduces UAM, a dual-stream vision-language-action model inspired by biological vision, which preserves multimodal competence during training and improves out-of-distribution generalization in manipulation tasks.
Contribution
Proposes the Unified Action Model (UAM) with a parallel dorsal pathway initialized from a pretrained generative model, enabling end-to-end training without parameter freezing or auxiliary data.
Findings
UAM retains over 95% of the underlying VLM's multimodal capability.
UAM achieves the highest success rates on various out-of-distribution manipulation tasks.
Architectural separation alone can preserve semantic capabilities in VLAs.
Abstract
Vision--language--action (VLA) models are typically built by fine-tuning a pretrained vision--language model (VLM) on action data. However, we show that this standard recipe systematically erodes the VLM's multimodal competence, a side effect we call the embodiment tax. But do VLAs have to forget? Inspired by the two-stream organization of biological vision, we trace this degradation to a structural bottleneck: current VLAs ask a single encoder to support both language-grounded semantics and control-relevant visual features, whereas biological vision separates recognition and visuomotor control into distinct pathways. Building on this view, we propose the Unified Action Model (UAM), which adds a parallel Dorsal Expert, an analog of the brain's dorsal pathway. To make the Dorsal Expert an effective second pathway and reduce the control-learning burden on the VLM, we initialize it from a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
