ItinBench: Benchmarking Planning Across Multiple Cognitive Dimensions with Large Language Models
Tianlong Wang, Pinqiao Wang, Weili Shi, Sheng li

TL;DR
ItinBench is a comprehensive benchmark designed to evaluate large language models across multiple cognitive domains, including verbal and spatial reasoning, revealing challenges in maintaining performance across diverse tasks.
Contribution
This work introduces ItinBench, a novel multi-dimensional reasoning benchmark integrating spatial and verbal tasks to assess LLMs' capabilities in complex, real-world scenarios.
Findings
LLMs struggle to perform consistently across multiple cognitive tasks
Incorporating diverse tasks reveals limitations in current LLM reasoning abilities
ItinBench provides new insights into multi-domain reasoning challenges
Abstract
Large language models (LLMs) with advanced cognitive capabilities are emerging as agents for various reasoning and planning tasks. Traditional evaluations often focus on specific reasoning or planning questions within controlled environments. Recent studies have explored travel planning as a medium to integrate various verbal reasoning tasks into real-world contexts. However, reasoning tasks extend beyond verbal reasoning alone, and a comprehensive evaluation of LLMs requires a testbed that incorporates tasks from multiple cognitive domains. To address this gap, we introduce ItinBench, a benchmark that features one task of spatial reasoning, i.e., route optimization, into trip itinerary planning while keeping the traditional verbal reasoning tasks. ItinBench evaluates various LLMs across diverse tasks simultaneously, including Llama 3.1 8B, Mistral Large, Gemini 1.5 Pro, and GPT family.…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. This benchmark provides a well-designed task decomposition. The four tasks isolate where the difficulty comes from and let the reader see where LLMs fail. This kind of ablation-by-task may provide positive impact to the community. 2. This paper reveals a notable finding: a significant improvement in models' spatial reasoning occurs when the spatial structure is provided as textual clusters (Tasks 3–4), while simply “asking” the model to optimize the route (Task 2) does not reduce Total-DG.
1. This paper combines the work of Travel Planner and ITINERA, examining both constraint satisfaction capability and traditional TSP-like metrics. Given our existing Travel Planner and ITINERA, the necessity of proposing such a new benchmark seems limited, offering limited insights. 2. The paper's title highlights "Multiple Cognitive Dimensions," yet the content primarily centers on spatial reasoning. This apparent narrowing of focus risks creating a perception of overclaiming relative to the b
1. Adds explicit spatial optimization metrics to travel planning evaluation. The paper introduces route optimization diagnostics (Total-DG, ECJ, ARG) based on TSP algorithms, extending existing benchmarks to quantify spatial efficiency. 2. Reveals heterogeneity in how different models approach spatial tasks. The finding that certain models (o1, GPT-4o) can utilize coordinates while others depend on textual cluster descriptions provides insight into varying spatial reasoning strategies, though s
1. Experimental confound in Task 3 prevents isolating the effect of cluster hints. Task 3 simultaneously introduces two changes: filtering candidates to a smaller subset and providing textual cluster descriptions, making it impossible to determine whether performance improvements arise from reduced search space, spatial cues, or their interaction. A missing control condition (filtered candidates without cluster hints) is needed to support the paper's claims about the value of spatial information
1. This paper broadens the research focus from general cognitive reasoning (verbal reasoning) to spatial reasoning, addressing a more challenging and realistic dimension of planning in real-world scenarios. 2. ItinBench introduces a well-structured four-task framework (verbal, mixed, spatial, and tool-use), allowing systematic analysis of reasoning trade-offs under increasing complexity. 3. The paper is clearly written and well-organized, making it easy for readers to follow the methodology and
1. The spatial reasoning component in ItinBench may overlap with existing route recommendation or mapping algorithms, which can already handle similar optimization tasks without relying on LLM-based reasoning. 2. While spatial reasoning is indeed crucial for LLMs—especially for embodied or physically grounded agents that must understand spatial relations (e.g., front–back, near–far), its current formulation in this benchmark is largely reduced to route optimization, a traditional planning proble
1. **Originality:** The work is groundbreaking in its incorporation of spatial reasoning as a first-class dimension within an LLM planning benchmark. The multi-dimensional evaluation framework, juxtaposing verbal and spatial reasoning, is conceptually novel and poses a forward-looking problem formulation. 2. **Quality:** The benchmark construction is highly rigorous. The data pipeline is clearly articulated and grounded in real-world data. The evaluation metric system is comprehensive and metho
1. **Breadth of Experimental Comparison:** While the internal task comparisons are thorough, the paper lacks performance comparisons with other state-of-the-art travel planning or spatial reasoning methods on the same dataset. Such comparisons would more clearly position ItinBench's challenge level relative to existing approaches. 2. **Depth of Investigation into "Pseudo-Spatial" Reasoning:** The paper astutely observes the phenomenon of LLMs relying on textual cues for spatial tasks. However,
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Spatial Cognition and Navigation · Constraint Satisfaction and Optimization
