OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions

Fangzhi Xu; Hang Yan; Qiushi Sun; Jinyang Wu; Zixian Huang; Muye Huang; Jingyang Gong; Zichen Ding; Kanzhi Cheng; Yian Wang; Xinyu Che; Zeyi Sun; Jian Zhang; Zhangyue Yin; Haoran Luo; Xuanjing Huang; Ben Kao; Jun Liu; Qika Lin

arXiv:2602.05843·cs.CL·February 6, 2026

OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions

Fangzhi Xu, Hang Yan, Qiushi Sun, Jinyang Wu, Zixian Huang, Muye Huang, Jingyang Gong, Zichen Ding, Kanzhi Cheng, Yian Wang, Xinyu Che, Zeyi Sun, Jian Zhang, Zhangyue Yin, Haoran Luo, Xuanjing Huang, Ben Kao, Jun Liu, Qika Lin

PDF

Open Access 1 Datasets

TL;DR

OdysseyArena introduces a new benchmarking framework for evaluating large language models on long-horizon, active, and inductive interactions, highlighting current models' limitations in autonomous discovery within complex environments.

Contribution

The paper presents OdysseyArena, a comprehensive benchmark with environments and tasks designed to assess LLMs' inductive learning and long-term strategic capabilities.

Findings

01

Leading LLMs show limited inductive reasoning in complex tasks.

02

OdysseyArena-Lite provides 120 standardized tasks for benchmarking.

03

OdysseyArena-Challenge stresses models on extreme interaction horizons.

Abstract

The rapid advancement of Large Language Models (LLMs) has catalyzed the development of autonomous agents capable of navigating complex environments. However, existing evaluations primarily adopt a deductive paradigm, where agents execute tasks based on explicitly provided rules and static goals, often within limited planning horizons. Crucially, this neglects the inductive necessity for agents to discover latent transition laws from experience autonomously, which is the cornerstone for enabling agentic foresight and sustaining strategic coherence. To bridge this gap, we introduce OdysseyArena, which re-centers agent evaluation on long-horizon, active, and inductive interactions. We formalize and instantiate four primitives, translating abstract transition dynamics into concrete interactive environments. Building upon this, we establish OdysseyArena-Lite for standardized benchmarking,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

xufangzhi/OdysseyArena
dataset· 10 dl
10 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Artificial Intelligence in Healthcare and Education