EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning   in Real-World Scenarios

Lu Qiu; Yi Chen; Yuying Ge; Yixiao Ge; Ying Shan; Xihui Liu

arXiv:2412.04447·cs.AI·April 14, 2025

EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios

Lu Qiu, Yi Chen, Yuying Ge, Yixiao Ge, Ying Shan, Xihui Liu

PDF

Open Access

TL;DR

EgoPlan-Bench2 is a comprehensive benchmark designed to evaluate the planning abilities of multimodal large language models in real-world scenarios, highlighting current limitations and proposing a training-free multimodal prompting approach to improve performance.

Contribution

The paper introduces EgoPlan-Bench2, a new benchmark for assessing MLLMs' planning skills in diverse real-world tasks, and demonstrates a prompt-based method to enhance GPT-4V without additional training.

Findings

01

MLLMs face significant challenges in real-world planning tasks.

02

A prompt-based approach improves GPT-4V's performance by over 10%.

03

EgoPlan-Bench2 covers 24 scenarios across 4 domains, closely mimicking daily life.

Abstract

The advent of Multimodal Large Language Models, leveraging the power of Large Language Models, has recently demonstrated superior multimodal understanding and reasoning abilities, heralding a new era for artificial general intelligence. However, achieving AGI necessitates more than just comprehension and reasoning. A crucial capability required is effective planning in diverse scenarios, which involves making reasonable decisions based on complex environments to solve real-world problems. Despite its importance, the planning abilities of current MLLMs in varied scenarios remain underexplored. In this paper, we introduce EgoPlan-Bench2, a rigorous and comprehensive benchmark designed to assess the planning capabilities of MLLMs across a wide range of real-world scenarios. EgoPlan-Bench2 encompasses everyday tasks spanning 4 major domains and 24 detailed scenarios, closely aligned with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems