ManiTaskGen: A Comprehensive Task Generator for Benchmarking and Improving Vision-Language Agents on Embodied Decision-Making

Liu Dai; Haina Wang; Weikang Wan; Hao Su

arXiv:2505.20726·cs.RO·July 30, 2025

ManiTaskGen: A Comprehensive Task Generator for Benchmarking and Improving Vision-Language Agents on Embodied Decision-Making

Liu Dai, Haina Wang, Weikang Wan, Hao Su

PDF

TL;DR

ManiTaskGen is a system that automatically creates diverse, feasible tasks within any scene to evaluate and improve embodied vision-language agents, advancing towards more general embodied AI.

Contribution

It introduces a universal framework for automatic task generation in scenes, enabling comprehensive benchmarking and agent enhancement in embodied decision-making.

Findings

01

Generated diverse tasks in simulated and real scenes.

02

Constructed new benchmarks for embodied decision-making.

03

Improved agent performance using ManiTaskGen tasks.

Abstract

Building embodied agents capable of accomplishing arbitrary tasks is a core objective towards achieving embodied artificial general intelligence (E-AGI). While recent work has advanced such general robot policies, their training and evaluation are often limited to tasks within specific scenes, involving restricted instructions and scenarios. Existing benchmarks also typically rely on manual annotation of limited tasks in a few scenes. We argue that exploring the full spectrum of feasible tasks within any given scene is crucial, as they provide both extensive benchmarks for evaluation and valuable resources for agent improvement. Towards this end, we introduce ManiTaskGen, a novel system that automatically generates comprehensive, diverse, feasible mobile manipulation tasks for any given scene. The generated tasks encompass both process-based, specific instructions (e.g., "move object…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.