Action with Visual Primitives

Weilong Guo; Yuchen Wang; Renping Zhou; Yunfeng Zhang; Rui Fang; Yue Meng; Wenda Xu; Yuan He; Gao Huang

arXiv:2605.22183·cs.RO·May 22, 2026

Action with Visual Primitives

Weilong Guo, Yuchen Wang, Renping Zhou, Yunfeng Zhang, Rui Fang, Yue Meng, Wenda Xu, Yuan He, Gao Huang

PDF

TL;DR

AVP introduces a visual-primitive-centric approach for robotic manipulation, improving success rates and generalization by decomposing actions into visual primitives and leveraging pretrained vision-language models.

Contribution

The paper proposes AVP, an end-to-end architecture that enhances robotic manipulation by integrating visual primitives with a flow-matching action expert, improving efficiency and generalization.

Findings

01

AVP improves success rate by 27.61% over pi_0.5.

02

AVP outperforms recent methods in data efficiency and generalization.

03

AVP demonstrates effective object-level transfer in robot experiments.

Abstract

Vision-Language-Action (VLA) models have emerged as a promising paradigm for generalist robotic manipulation. A common design in current architectures maps language instructions and visual observations to actions in a single forward pass. While conceptually simple, this formulation entangles instruction comprehension, spatial scene understanding, and motor control within a single learning objective. As a result, the action expert must implicitly relearn cognitive and perceptual capabilities already present in the pretrained VLM, which can limit both learning efficiency and generalization. We introduce AVP (Action with Visual Primitives), an end-to-end architecture that implements this visual-primitive-centric interface: the VLM infers the next-stage target and emits visual-primitive tokens that condition a flow-matching action expert, with supervision derived from end-effector…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.