EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios
Lu Qiu, Yi Chen, Yuying Ge, Yixiao Ge, Ying Shan, Xihui Liu

TL;DR
EgoPlan-Bench2 is a comprehensive benchmark designed to evaluate the planning abilities of multimodal large language models in real-world scenarios, highlighting current limitations and proposing a training-free multimodal prompting approach to improve performance.
Contribution
The paper introduces EgoPlan-Bench2, a new benchmark for assessing MLLMs' planning skills in diverse real-world tasks, and demonstrates a prompt-based method to enhance GPT-4V without additional training.
Findings
MLLMs face significant challenges in real-world planning tasks.
A prompt-based approach improves GPT-4V's performance by over 10%.
EgoPlan-Bench2 covers 24 scenarios across 4 domains, closely mimicking daily life.
Abstract
The advent of Multimodal Large Language Models, leveraging the power of Large Language Models, has recently demonstrated superior multimodal understanding and reasoning abilities, heralding a new era for artificial general intelligence. However, achieving AGI necessitates more than just comprehension and reasoning. A crucial capability required is effective planning in diverse scenarios, which involves making reasonable decisions based on complex environments to solve real-world problems. Despite its importance, the planning abilities of current MLLMs in varied scenarios remain underexplored. In this paper, we introduce EgoPlan-Bench2, a rigorous and comprehensive benchmark designed to assess the planning capabilities of MLLMs across a wide range of real-world scenarios. EgoPlan-Bench2 encompasses everyday tasks spanning 4 major domains and 24 detailed scenarios, closely aligned with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
