ActPlan-1K: Benchmarking the Procedural Planning Ability of Visual   Language Models in Household Activities

Ying Su; Zhan Ling; Haochen Shi; Jiayang Cheng; Yauwai Yim; Yangqiu; Song

arXiv:2410.03907·cs.CL·October 8, 2024

ActPlan-1K: Benchmarking the Procedural Planning Ability of Visual Language Models in Household Activities

Ying Su, Zhan Ling, Haochen Shi, Jiayang Cheng, Yauwai Yim, Yangqiu, Song

PDF

Open Access

TL;DR

This paper introduces ActPlan-1K, a comprehensive benchmark for evaluating the procedural planning and reasoning abilities of vision-language models in household activities, including normal and counterfactual scenarios.

Contribution

It creates a multi-modal, counterfactual planning benchmark based on ChatGPT and iGibson2, filling a gap in evaluating VLMs' reasoning in household tasks.

Findings

01

Current VLMs struggle with human-level planning accuracy.

02

The benchmark includes 153 activities and 1,187 instances.

03

Automatic evaluation metrics are proposed for future research.

Abstract

Large language models~(LLMs) have been adopted to process textual task description and accomplish procedural planning in embodied AI tasks because of their powerful reasoning ability. However, there is still lack of study on how vision language models~(VLMs) behave when multi-modal task inputs are considered. Counterfactual planning that evaluates the model's reasoning ability over alternative task situations are also under exploited. In order to evaluate the planning ability of both multi-modal and counterfactual aspects, we propose ActPlan-1K. ActPlan-1K is a multi-modal planning benchmark constructed based on ChatGPT and household activity simulator iGibson2. The benchmark consists of 153 activities and 1,187 instances. Each instance describing one activity has a natural language task description and multiple environment images from the simulator. The gold plan of each instance is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPersona Design and Applications · BIM and Construction Integration · Speech and dialogue systems