PAPO-VLA: Planning-Aware Policy Optimization for Vision-Language-Action Models
Peizheng Guo, Jingyao Wang, Changwen Zheng, Wenwen Qiang

TL;DR
This paper introduces PAPO-VLA, a novel planning-aware optimization method for vision-language-action models that enhances the reliability of robotic task execution by explicitly identifying and emphasizing planning actions.
Contribution
It proposes a new approach to improve VLA policy training by explicitly identifying and weighting planning actions based on their importance for task success.
Findings
PAPO-VLA outperforms existing methods on multiple benchmarks.
Explicitly modeling planning actions improves task success rates.
The method effectively emphasizes critical actions during policy optimization.
Abstract
Vision-Language-Action (VLA) models show promising ability in language-guided robotic tasks. However, making VLA policies reliable remains challenging, because a manipulation task is completed through closed-loop interaction, where each action affects subsequent execution. To analyze this problem, we revisit VLA policy during execution and argue that a VLA policy acts both as a planner, which makes task-oriented decisions that change the direction of execution, and as an executor, which realizes these decisions through dense continuous actions. This view suggests that improving VLA reliability requires particular attention to planning actions. Existing optimization methods can imitate actions or improve complete trajectories, but they usually do not explicitly identify planning actions or measure their importance for task success. To address this issue, we propose Planning-Aware Policy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
