When Robots Do the Chores: A Benchmark and Agent for Long-Horizon Household Task Execution

Zilin Zhu; Longteng Guo; Yanghong Mei; Bowen Pang; Zongxun Zhang; Xingjian He; Ruyi Ji; and Jing Liu

arXiv:2605.14504·cs.AI·May 19, 2026

When Robots Do the Chores: A Benchmark and Agent for Long-Horizon Household Task Execution

Zilin Zhu, Longteng Guo, Yanghong Mei, Bowen Pang, Zongxun Zhang, Xingjian He, Ruyi Ji, and Jing Liu

PDF

TL;DR

LongAct is a new benchmark for evaluating high-level planning in long-horizon household tasks, and HoloMind is a VLM-driven agent designed to address these challenges, highlighting the need for improved planning capabilities.

Contribution

The paper introduces LongAct, a benchmark for long-horizon household tasks, and HoloMind, a hierarchical agent with memory modules, advancing the evaluation and development of planning in embodied AI.

Findings

01

HoloMind improves long-horizon task performance significantly.

02

Top models achieve only 59% goal completion, indicating high task difficulty.

03

HoloMind reduces reliance on large model scale while enhancing performance.

Abstract

Long-horizon household tasks demand robust high-level planning and sustained reasoning capabilities, which are largely overlooked by existing embodied AI benchmarks that emphasize short-horizon navigation or manipulation and rely on fixed task categories. We introduce LongAct, a benchmark designed to evaluate planning-level autonomy in long-horizon household tasks specified through free-form instructions. By abstracting away embodiment-specific low-level control, LongAct isolates high-level cognitive capabilities such as instruction understanding, dependency management, memory maintenance, and adaptive planning. We further propose HoloMind, a VLM-driven agent with a DAG-based long-horizon hierarchical planner, a Multimodal Spatial Memory for persistent world modeling, an Episodic Memory for experience reuse, and a global Critic for reflective supervision. Experiments with GPT-5 and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.