When Robots Do the Chores: A Benchmark and Agent for Long-Horizon Household Task Execution
Zilin Zhu, Longteng Guo, Yanghong Mei, Bowen Pang, Zongxun Zhang, Xingjian He, Ruyi Ji, and Jing Liu

TL;DR
LongAct is a new benchmark for evaluating high-level planning in long-horizon household tasks, and HoloMind is a VLM-driven agent designed to address these challenges, highlighting the need for improved planning capabilities.
Contribution
The paper introduces LongAct, a benchmark for long-horizon household tasks, and HoloMind, a hierarchical agent with memory modules, advancing the evaluation and development of planning in embodied AI.
Findings
HoloMind improves long-horizon task performance significantly.
Top models achieve only 59% goal completion, indicating high task difficulty.
HoloMind reduces reliance on large model scale while enhancing performance.
Abstract
Long-horizon household tasks demand robust high-level planning and sustained reasoning capabilities, which are largely overlooked by existing embodied AI benchmarks that emphasize short-horizon navigation or manipulation and rely on fixed task categories. We introduce LongAct, a benchmark designed to evaluate planning-level autonomy in long-horizon household tasks specified through free-form instructions. By abstracting away embodiment-specific low-level control, LongAct isolates high-level cognitive capabilities such as instruction understanding, dependency management, memory maintenance, and adaptive planning. We further propose HoloMind, a VLM-driven agent with a DAG-based long-horizon hierarchical planner, a Multimodal Spatial Memory for persistent world modeling, an Episodic Memory for experience reuse, and a global Critic for reflective supervision. Experiments with GPT-5 and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
