Flex-TravelPlanner: A Benchmark for Flexible Planning with Language Agents
Juhyun Oh, Eunsu Kim, Alice Oh

TL;DR
Flex-TravelPlanner introduces a new benchmark for evaluating language models' ability to adapt to dynamic, multi-turn planning scenarios with competing constraints, revealing limitations in current models' flexibility and prioritization skills.
Contribution
This work presents Flex-TravelPlanner, a novel benchmark with dynamic, multi-turn planning scenarios and constraint prioritization, extending existing static planning evaluations for language models.
Findings
Models perform poorly on multi-turn adaptation tasks.
Order of constraint introduction significantly impacts performance.
Models often misprioritize constraints, favoring recent lower-priority ones.
Abstract
Real-world planning problems require constant adaptation to changing requirements and balancing of competing constraints. However, current benchmarks for evaluating LLMs' planning capabilities primarily focus on static, single-turn scenarios. We introduce Flex-TravelPlanner, a benchmark that evaluates language models' ability to reason flexibly in dynamic planning scenarios. Building on the TravelPlanner dataset~\citep{xie2024travelplanner}, we introduce two novel evaluation settings: (1) sequential constraint introduction across multiple turns, and (2) scenarios with explicitly prioritized competing constraints. Our analysis of GPT-4o and Llama 3.1 70B reveals several key findings: models' performance on single-turn tasks poorly predicts their ability to adapt plans across multiple turns; constraint introduction order significantly affects performance; and models struggle with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI-based Problem Solving and Planning · Multimodal Machine Learning Applications · Artificial Intelligence in Games
MethodsFocus · LLaMA
