Beyond Itinerary Planning-A Real-World Benchmark for Multi-Turn and Tool-Using Travel Tasks

Xiang Cheng; Yulan Hu; Xiangwen Zhang; Lu Xu; Lide Tan; Zheng Pan; Xin Li; Yong Liu

arXiv:2512.22673·cs.AI·April 22, 2026

Beyond Itinerary Planning-A Real-World Benchmark for Multi-Turn and Tool-Using Travel Tasks

Xiang Cheng, Yulan Hu, Xiangwen Zhang, Lu Xu, Lide Tan, Zheng Pan, Xin Li, Yong Liu

PDF

TL;DR

TravelBench is a comprehensive benchmark designed to evaluate large language models' abilities in real-world multi-turn travel planning tasks, including problem-solving, user interaction, and recognizing limitations.

Contribution

The paper introduces TravelBench, a realistic travel planning benchmark with real user data, tools, and subtasks to assess LLMs' core capabilities in practical scenarios.

Findings

01

Advanced models show imbalanced performance across capabilities.

02

TravelBench is stable and reproducible for evaluating travel planning agents.

03

The benchmark covers problem-solving, user interaction, and capability boundary recognition.

Abstract

Travel planning is a natural real-world task to test large language models' (LLMs) planning and tool-use abilities. Although prior work has studied LLM performance on travel planning, existing settings still differ from real-world needs, mainly due to limited domain coverage, insufficient modeling of users' implicit preferences in multi-turn conversations, and a lack of evaluation of agents' capability boundaries. To mitigate these gaps, we propose $TravelBench$ , a benchmark for $truly real-world$ travel planning. We collect user queries, user preferences, and tools from real scenarios, and construct three subtasks -- $Single-Turn$ , $Multi-Turn$ , and $Unsolvable$ -- to evaluate agents' three core capabilities in real settings: (1) solving problems independently, (2) interacting with users to elicit implicit preferences, and (3) recognizing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.