TaskGround: Structured Executable Task Inference for Full-Scene Household Reasoning
ZhiYuan Feng, Yu Deng, Ruichuan An, Zhenhua Liu, Qixiu Li, Keming Wu, Zhiying Du, Weijie Wang, Haoxiao Wang, Shuang Chen, Sicheng Xu, Yaobo Liang, Jiaolong Yang, and Baining Guo

TL;DR
TaskGround introduces a framework for household agents to infer and execute structured tasks from complete scene data, improving success rates and efficiency in household reasoning tasks.
Contribution
It presents a training-free, model-agnostic method for grounding scenes and inferring task structure, enhancing the capabilities of compact models in household reasoning.
Findings
TaskGround improves task success rates significantly.
Makes Qwen3.5-9B competitive with GPT-5 in scene understanding.
Reduces input-token cost by up to 18x.
Abstract
In real home deployments, household agents must often operate from a complete household scene and a situated household request, rather than from a clean task specification. Such requests require agents to identify task-relevant entities, recover intended task conditions, and resolve ordering constraints from the surrounding scene context. We formalize this capability as full-scene household reasoning: given a complete household scene and a situated household request, an agent must infer executable task structure before producing a grounded skill-level action sequence. This setting is challenging because complete household scenes contain substantial task-irrelevant information, making direct complete-scene prompting inefficient and error-prone. In practical deployment, this challenge is further amplified by privacy and local compute constraints, which favor compact open-weight models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
