TRIP-Bench: A Benchmark for Long-Horizon Interactive Agents in Real-World Scenarios

Yuanzhe Shen; Zisu Huang; Zhengyuan Wang; Muzhao Tian; Zhengkang Guo; Chenyang Zhang; Shuaiyu Zhou; Zengjie Hu; Dailin Li; Jingwen Xu; Kaimin Wang; Wenhao Liu; Tianlong Li; Fengpeng Yue; Feng Hong; Cao Liu; Ke Zeng

arXiv:2602.01675·cs.AI·February 3, 2026

TRIP-Bench: A Benchmark for Long-Horizon Interactive Agents in Real-World Scenarios

Yuanzhe Shen, Zisu Huang, Zhengyuan Wang, Muzhao Tian, Zhengkang Guo, Chenyang Zhang, Shuaiyu Zhou, Zengjie Hu, Dailin Li, Jingwen Xu, Kaimin Wang, Wenhao Liu, Tianlong Li, Fengpeng Yue, Feng Hong, Cao Liu, Ke Zeng

PDF

Open Access

TL;DR

TRIP-Bench is a comprehensive, real-world travel-planning benchmark designed to evaluate long-horizon interactive agents, revealing current models' limitations and introducing GTPO, a reinforcement learning method that enhances performance and robustness.

Contribution

The paper introduces TRIP-Bench, a realistic long-horizon benchmark for interactive agents, and proposes GTPO, an online RL method that improves constraint satisfaction and robustness.

Findings

01

Models achieve at most 50% success on easy split.

02

Performance drops below 10% on hard subsets.

03

GTPO outperforms Gemini-3-Pro in evaluations.

Abstract

As LLM-based agents are deployed in increasingly complex real-world settings, existing benchmarks underrepresent key challenges such as enforcing global constraints, coordinating multi-tool reasoning, and adapting to evolving user behavior over long, multi-turn interactions. To bridge this gap, we introduce \textbf{TRIP-Bench}, a long-horizon benchmark grounded in realistic travel-planning scenarios. TRIP-Bench leverages real-world data, offers 18 curated tools and 40+ travel requirements, and supports automated evaluation. It includes splits of varying difficulty; the hard split emphasizes long and ambiguous interactions, style shifts, feasibility changes, and iterative version revision. Dialogues span up to 15 user turns, can involve 150+ tool calls, and may exceed 200k tokens of context. Experiments show that even advanced models achieve at most 50\% success on the easy split, with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Constraint Satisfaction and Optimization · Topic Modeling