Beyond Itinerary Planning-A Real-World Benchmark for Multi-Turn and Tool-Using Travel Tasks
Xiang Cheng, Yulan Hu, Xiangwen Zhang, Lu Xu, Lide Tan, Zheng Pan, Xin Li, Yong Liu

TL;DR
TravelBench is a comprehensive benchmark designed to evaluate large language models' abilities in real-world multi-turn travel planning tasks, including problem-solving, user interaction, and recognizing limitations.
Contribution
The paper introduces TravelBench, a realistic travel planning benchmark with real user data, tools, and subtasks to assess LLMs' core capabilities in practical scenarios.
Findings
Advanced models show imbalanced performance across capabilities.
TravelBench is stable and reproducible for evaluating travel planning agents.
The benchmark covers problem-solving, user interaction, and capability boundary recognition.
Abstract
Travel planning is a natural real-world task to test large language models' (LLMs) planning and tool-use abilities. Although prior work has studied LLM performance on travel planning, existing settings still differ from real-world needs, mainly due to limited domain coverage, insufficient modeling of users' implicit preferences in multi-turn conversations, and a lack of evaluation of agents' capability boundaries. To mitigate these gaps, we propose , a benchmark for travel planning. We collect user queries, user preferences, and tools from real scenarios, and construct three subtasks -- , , and -- to evaluate agents' three core capabilities in real settings: (1) solving problems independently, (2) interacting with users to elicit implicit preferences, and (3) recognizing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
