HELM: Harness-Enhanced Long-horizon Memory for Vision-Language-Action Manipulation
Zijian Zeng, Fei Ding, Huiming Yang, Xianwei Li

TL;DR
HELM is a framework that significantly improves long-horizon vision-language manipulation by addressing memory, verification, and recovery gaps with specialized modules, outperforming existing models on multiple benchmarks.
Contribution
The paper introduces HELM, a model-agnostic framework with novel components like episodic memory, a learned verifier, and a rollback controller, to enhance long-horizon manipulation success.
Findings
HELM improves success rate by 23.1 percentage points on LIBERO-LONG.
The learned verifier outperforms rule-based and ensemble baselines.
HELM enhances recovery success under perturbations.
Abstract
Vision-Language-Action (VLA) models fail systematically on long-horizon manipulation tasks despite strong short-horizon performance. We show that this failure is not resolved by extending context length alone in the current reactive execution setting; instead, it stems from three recurring execution-loop deficiencies: the memory gap, the verification gap, and the recovery gap. We present HELM, a model-agnostic framework that addresses these deficiencies with three components: an Episodic Memory Module (EMM) that retrieves key task history via CLIP-indexed keyframes, a learned State Verifier (SV) that predicts action failure before execution from observation, action, subgoal, and memory-conditioned context, and a Harness Controller (HC) that performs rollback and replanning. The SV is the core learning contribution: it consistently outperforms rule-based feasibility checks and ensemble…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
