TL;DR
PALM introduces a vision-language-action framework that enhances long-horizon robotic manipulation by reasoning about affordances and tracking subtask progress, leading to significant improvements in success rates and task completion metrics.
Contribution
It presents a novel affordance reasoning and progress prediction approach that stabilizes and improves long-horizon policy execution in robotic manipulation.
Findings
Achieved 91.8% success on LIBERO-LONG benchmark.
Improved average task length by 12.5% on CALVIN ABC->D.
Doubled real-world performance over baseline methods.
Abstract
Recent advancements in vision-language-action (VLA) models have shown promise in robotic manipulation, yet they continue to struggle with long-horizon, multi-step tasks. Existing methods lack internal reasoning mechanisms that can identify task-relevant interaction cues or track progress within a subtask, leading to critical execution errors such as repeated actions, missed steps, and premature termination. To address these challenges, we introduce PALM, a VLA framework that structures policy learning around interaction-centric affordance reasoning and subtask progress cues. PALM distills complementary affordance representations that capture object relevance, contact geometry, spatial placements, and motion dynamics, and serve as task-relevant anchors for visuomotor control. To further stabilize long-horizon execution, PALM predicts continuous within-subtask progress, enabling seamless…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
