HELM: Harness-Enhanced Long-horizon Memory for Vision-Language-Action Manipulation

Zijian Zeng; Fei Ding; Huiming Yang; Xianwei Li

arXiv:2604.18791·cs.LG·April 22, 2026

HELM: Harness-Enhanced Long-horizon Memory for Vision-Language-Action Manipulation

Zijian Zeng, Fei Ding, Huiming Yang, Xianwei Li

PDF

TL;DR

HELM is a framework that significantly improves long-horizon vision-language manipulation by addressing memory, verification, and recovery gaps with specialized modules, outperforming existing models on multiple benchmarks.

Contribution

The paper introduces HELM, a model-agnostic framework with novel components like episodic memory, a learned verifier, and a rollback controller, to enhance long-horizon manipulation success.

Findings

01

HELM improves success rate by 23.1 percentage points on LIBERO-LONG.

02

The learned verifier outperforms rule-based and ensemble baselines.

03

HELM enhances recovery success under perturbations.

Abstract

Vision-Language-Action (VLA) models fail systematically on long-horizon manipulation tasks despite strong short-horizon performance. We show that this failure is not resolved by extending context length alone in the current reactive execution setting; instead, it stems from three recurring execution-loop deficiencies: the memory gap, the verification gap, and the recovery gap. We present HELM, a model-agnostic framework that addresses these deficiencies with three components: an Episodic Memory Module (EMM) that retrieves key task history via CLIP-indexed keyframes, a learned State Verifier (SV) that predicts action failure before execution from observation, action, subgoal, and memory-conditioned context, and a Harness Controller (HC) that performs rollback and replanning. The SV is the core learning contribution: it consistently outperforms rule-based feasibility checks and ensemble…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.