REALM-Bench: A Benchmark for Evaluating Multi-Agent Systems on Real-world, Dynamic Planning and Scheduling Tasks

Longling Geng; Edward Y. Chang

arXiv:2502.18836·cs.AI·August 6, 2025

REALM-Bench: A Benchmark for Evaluating Multi-Agent Systems on Real-world, Dynamic Planning and Scheduling Tasks

Longling Geng, Edward Y. Chang

PDF

2 Repos

TL;DR

REALM-Bench offers a comprehensive evaluation framework for multi-agent systems tackling real-world, dynamic planning and scheduling tasks, incorporating scalable complexity and diverse problem scenarios.

Contribution

It introduces a standardized benchmark suite with diverse problems, evaluation metrics, and baseline implementations to advance research in real-world multi-agent planning and scheduling.

Findings

01

Benchmark covers 14 complex planning problems

02

Includes 15 comparison methods and multiple LLM-based frameworks

03

Aims to standardize evaluation and foster progress in real-world AI planning

Abstract

This benchmark suite provides a comprehensive evaluation framework for assessing both individual LLMs and multi-agent systems in Real-world planning and scheduling scenarios. The suite encompasses 14 designed planning and scheduling problems that progress from basic to highly complex, incorporating key aspects such as multi-agent coordination, inter-agent dependencies, and dynamic environmental disruptions. Each problem can be scaled along three dimensions: the number of parallel planning threads, the complexity of inter-dependencies, and the frequency of unexpected disruptions requiring Real-time adaptation. The benchmark includes 14 detailed problem specifications, 15 comparison methods including Random, LPT, SPT, STPT, MPSR, DRL-Liu, GP, GEP, LSO, SPT/TWKR, DRL-Chen, DRL-Zhang, 2+ evaluation metrics, and baseline implementations using 3+ LLMs including GPT-4o, Claude-3.7,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.