A Critical Assessment of Benchmark Comparison in Planning
E. Dahlman, A. E. Howe

TL;DR
This paper critically examines the assumptions underlying empirical comparisons of planning algorithms, revealing that many are unsupported and highlighting the need for improved benchmarks and methodologies in planning research.
Contribution
It identifies and empirically tests eight implicit assumptions in planning benchmark comparisons, showing that most are invalid and calling for methodological improvements.
Findings
Most assumptions about problem and planner performance are unsupported.
Different planners are affected differently by benchmarking assumptions.
The study advocates for better benchmark problems and evaluation practices.
Abstract
Recent trends in planning research have led to empirical comparison becoming commonplace. The field has started to settle into a methodology for such comparisons, which for obvious practical reasons requires running a subset of planners on a subset of problems. In this paper, we characterize the methodology and examine eight implicit assumptions about the problems, planners and metrics used in many of these comparisons. The problem assumptions are: PR1) the performance of a general purpose planner should not be penalized/biased if executed on a sampling of problems and domains, PR2) minor syntactic differences in representation do not affect performance, and PR3) problems should be solvable by STRIPS capable planners unless they require ADL. The planner assumptions are: PL1) the latest version of a planner is the best one to use, PL2) default parameter settings approximate good…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
