TL;DR
ReCAPA introduces a hierarchical predictive correction framework for vision-language-action systems to reduce cascading failures and improve task execution accuracy in multimodal environments.
Contribution
It proposes a novel prediction and contrast-based alignment architecture with new metrics for error propagation and recovery in long-horizon tasks.
Findings
ReCAPA outperforms baseline models on VisualAgentBench, MineDojo, and AI2-THOR.
The framework effectively mitigates error propagation during multi-step task execution.
Experiments demonstrate improved alignment and task success rates.
Abstract
Vision-Language-Action systems follow instructions to execute multi-step tasks in multimodal environments. Recent VLA approaches typically rely on post-hoc correction mechanisms or operate under fixed task decompositions and alignment schemes. However, once an intermediate step is mis-specified, local errors propagate through subsequent steps and eventually accumulate into cascading failures. To mitigate this compounding effect, we propose Predictive Alignment and Planning Architecture, a framework that uses prediction and contrast to adjust deviations across three levels: actions, subgoals, and trajectories. Semantic alignment is enforced at all levels using a Sinkhorn-based module and a Score-field module. The predictive correction and alignment jointly update the action generator during training, enabling it to adjust fine-grained steps to remain aligned with the overall intent. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
