Decoupled Action Expert: Confining Task Knowledge to the Conditioning Pathway

Jian Zhou; Sihao Lin; Shuai Fu; Zerui Li; Gengze Zhou; Qi WU

arXiv:2511.12101·cs.RO·March 17, 2026

Decoupled Action Expert: Confining Task Knowledge to the Conditioning Pathway

Jian Zhou, Sihao Lin, Shuai Fu, Zerui Li, Gengze Zhou, Qi WU

PDF

Open Access

TL;DR

This paper demonstrates that task-specific knowledge in vision-language-action models can be confined to the conditioning pathway, allowing for smaller, task-agnostic backbones that maintain performance across multiple tasks.

Contribution

It introduces a decoupled training method where the action head is pretrained separately and frozen, showing that large backbones are unnecessary for effective action generation.

Findings

01

A frozen backbone with a separate action head performs comparably to trained models.

02

Pretraining signals have little impact on downstream performance.

03

A small MLP backbone can replace large U-Net models without loss of accuracy.

Abstract

Many recent Vision-Language-Action models employ diffusion or flow-matching backbones with hundreds of millions of parameters for action generation. However, unlike image synthesis where the output spans millions of diverse pixels, a manipulation policy generates only short sequences of low-dimensional, physically correlated action values, a far simpler target that should not demand such capacity. We confirm this intuition and show that task-specific knowledge in these policies can be fully confined to the conditioning pathway, leaving the action backbone task-agnostic. To establish this, we introduce a decoupled training recipe: a general-purpose action head is first pretrained on observation-free forward-kinematics data, then frozen while only the conditioning pathway is trained for downstream tasks. Using Diffusion Policy as a testbed, we show that on both MimicGen and LIBERO, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Reinforcement Learning in Robotics · Multimodal Machine Learning Applications