TL;DR
The paper introduces LoHo-Manip, a modular framework that enhances long-horizon vision-language-action manipulation by combining a task manager with visual trace planning, improving robustness and success in complex tasks.
Contribution
It presents a novel decoupled planning and execution approach using a trace-conditioned VLA system for scalable long-horizon manipulation tasks.
Findings
Significant improvements in long-horizon success rates.
Enhanced robustness and generalization in manipulation tasks.
Effective replanning without hand-crafted recovery logic.
Abstract
Long-horizon manipulation remains challenging for vision-language-action (VLA) policies: real tasks are multi-step, progress-dependent, and brittle to compounding execution errors. We present LoHo-Manip, a modular framework that scales short-horizon VLA execution to long-horizon instruction following via a dedicated task-management VLM. The manager is decoupled from the executor and is invoked in a receding-horizon manner: given the current observation, it predicts a progress-aware remaining plan that combines (i) a subtask sequence with an explicit done + remaining split as lightweight language memory, and (ii) a visual trace -- a compact 2D keypoint trajectory prompt specifying where to go and what to approach next. The executor VLA is adapted to condition on the rendered trace, thereby turning long-horizon decision-making into repeated local control by following the trace. Crucially,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
