Decoupled Action Expert: Confining Task Knowledge to the Conditioning Pathway
Jian Zhou, Sihao Lin, Shuai Fu, Zerui Li, Gengze Zhou, Qi WU

TL;DR
This paper demonstrates that task-specific knowledge in vision-language-action models can be confined to the conditioning pathway, allowing for smaller, task-agnostic backbones that maintain performance across multiple tasks.
Contribution
It introduces a decoupled training method where the action head is pretrained separately and frozen, showing that large backbones are unnecessary for effective action generation.
Findings
A frozen backbone with a separate action head performs comparably to trained models.
Pretraining signals have little impact on downstream performance.
A small MLP backbone can replace large U-Net models without loss of accuracy.
Abstract
Many recent Vision-Language-Action models employ diffusion or flow-matching backbones with hundreds of millions of parameters for action generation. However, unlike image synthesis where the output spans millions of diverse pixels, a manipulation policy generates only short sequences of low-dimensional, physically correlated action values, a far simpler target that should not demand such capacity. We confirm this intuition and show that task-specific knowledge in these policies can be fully confined to the conditioning pathway, leaving the action backbone task-agnostic. To establish this, we introduce a decoupled training recipe: a general-purpose action head is first pretrained on observation-free forward-kinematics data, then frozen while only the conditioning pathway is trained for downstream tasks. Using Diffusion Policy as a testbed, we show that on both MimicGen and LIBERO, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Reinforcement Learning in Robotics · Multimodal Machine Learning Applications
