Action with Visual Primitives
Weilong Guo, Yuchen Wang, Renping Zhou, Yunfeng Zhang, Rui Fang, Yue Meng, Wenda Xu, Yuan He, Gao Huang

TL;DR
AVP introduces a visual-primitive-centric approach for robotic manipulation, improving success rates and generalization by decomposing actions into visual primitives and leveraging pretrained vision-language models.
Contribution
The paper proposes AVP, an end-to-end architecture that enhances robotic manipulation by integrating visual primitives with a flow-matching action expert, improving efficiency and generalization.
Findings
AVP improves success rate by 27.61% over pi_0.5.
AVP outperforms recent methods in data efficiency and generalization.
AVP demonstrates effective object-level transfer in robot experiments.
Abstract
Vision-Language-Action (VLA) models have emerged as a promising paradigm for generalist robotic manipulation. A common design in current architectures maps language instructions and visual observations to actions in a single forward pass. While conceptually simple, this formulation entangles instruction comprehension, spatial scene understanding, and motor control within a single learning objective. As a result, the action expert must implicitly relearn cognitive and perceptual capabilities already present in the pretrained VLM, which can limit both learning efficiency and generalization. We introduce AVP (Action with Visual Primitives), an end-to-end architecture that implements this visual-primitive-centric interface: the VLM infers the next-stage target and emits visual-primitive tokens that condition a flow-matching action expert, with supervision derived from end-effector…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
