LUMINA: Long-horizon Understanding for Multi-turn Interactive Agents

Amin Rakhsha; Thomas Hehn; Pietro Mazzaglia; Fabio Valerio Massoli; Arash Behboodi; Tribhuvanesh Orekondy

arXiv:2601.16649·cs.AI·January 26, 2026

LUMINA: Long-horizon Understanding for Multi-turn Interactive Agents

Amin Rakhsha, Thomas Hehn, Pietro Mazzaglia, Fabio Valerio Massoli, Arash Behboodi, Tribhuvanesh Orekondy

PDF

Open Access 5 Reviews

TL;DR

This paper introduces LUMINA, a framework for evaluating the importance of specific skills like planning and state tracking in multi-turn AI agents using oracle interventions in procedurally generated tasks.

Contribution

The paper develops an oracle counterfactual framework and a suite of controlled, game-like tasks to measure the impact of different skills on multi-turn agent performance.

Findings

01

Planning consistently improves performance across environments.

02

The usefulness of skills varies depending on environment properties.

03

Procedurally generated tasks enable precise measurement of skill contributions.

Abstract

Large language models can perform well on many isolated tasks, yet they continue to struggle on multi-turn, long-horizon agentic problems that require skills such as planning, state tracking, and long context processing. In this work, we aim to better understand the relative importance of advancing these underlying capabilities for success on such tasks. We develop an oracle counterfactual framework for multi-turn problems that asks: how would an agent perform if it could leverage an oracle to perfectly perform a specific task? The change in the agent's performance due to this oracle assistance allows us to measure the criticality of such oracle skill in the future advancement of AI agents. We introduce a suite of procedurally generated, game-like tasks with tunable complexity. These controlled environments allow us to provide precise oracle interventions, such as perfect planning or…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 6Confidence 4

Strengths

Originality: The paper designs three procedurally generated multi-turn environments, which facilitate the study of which skills have the greatest impact on agent capability. Significance: Analyzing which skills, or combinations of skills, constitute the main bottlenecks to advancing capable multi-turn agents is highly meaningful, as it provides guidance for targeted improvements. Clarity: The paper is clearly written.

Weaknesses

Quality: The capability improvements observed in simulation environments may not necessarily transfer to real-world settings. Significance: Can the conclusions drawn from simulation environments be applied to benchmarks in real-world scenarios?

Reviewer 02Rating 4Confidence 4

Strengths

Three procedurally generated environments enable controllable complexity. They are designed with simple action spaces and trajectory-level annotations, supporting accurate measurement of optimal actions.

Weaknesses

The idea of using oracle-based counterfactual interventions to dissect agent capabilities is interesting. However, I have some concerns. The three oracle modules are treated as independent switches, but they interact tightly. Oplan converts the decision into a one-step optimal subtask, inherently reducing the need for state inference or history recall. Ostate summarization may already encode most of the historical trajectory. Also the simplification of Ohistory as truncate earlier steps is ques

Reviewer 03Rating 6Confidence 3

Strengths

The paper identifies a crucial and underexplored limitation in current LLM-based agents—their inability to maintain robust long-horizon reasoning across multiple turns. The motivation is well-grounded in empirical evidence (e.g., low success rates despite high per-step accuracy), and the authors effectively position long-horizon understanding as a distinct capability beyond standard reasoning or planning. The introduction of an oracle counterfactual intervention framework is a major methodologi

Weaknesses

While the proposed environments (ListWorld, TreeWorld, GridWorld) are carefully controlled and effective for isolating individual skills, they remain relatively synthetic and detached from widely adopted agentic benchmarks such as ScienceWorld, OSWorld, or TravelPlanner. As a result, the paper provides valuable mechanistic insight but lacks direct evidence that the identified skill bottlenecks generalize to real-world multi-turn tasks. This limitation weakens the practical applicability and exte

Reviewer 04Rating 2Confidence 3

Strengths

This paper provides a clear analysis of compounding errors in long-horizon tasks, showing how small step-wise mistakes accumulate to reduce overall success. It further disentangles specific agentic skills and introduces controlled environments to assess their individual contributions. Experiments across multiple skill combinations and model scales reveal that larger models can leverage longer contextual dependencies more effectively.

Weaknesses

1. The proposed environments are symbolic and fully rule-defined, omitting key challenges of real-world tasks such as parsing unstructured feedback. It is therefore unclear whether the identified bottlenecks generalize to real-world tasks. 2. Some of the findings have been reported in other works, which may limit the novelty of the results. For instance, recent studies on memory-augmented and planning-based agents have shown that these components can substantially influence performance. 3. A

Reviewer 05Rating 4Confidence 3

Strengths

1. Construction of three worlds to enable oracle skill control contributes to the research community. The worlds enable faithfully constructed oracle skills to study the behavior and performance of LLM agents, which is helpful to understand what affects LLM's performance. 2. The finding about LLMs excel at each step but performs relatively poorly in the entire horizon is interesting.

Weaknesses

1. I would like to see stronger models' performance, like Qwen3-235B you have mentioned in the abstract, and also GPT-4o, maybe GPT-5. I would also like to see how o3 or o4-mini models performs. I am concerned about those tasks maybe only hard enough for small open source models (Qwen3-4b can get 86% with state tracking and planning in grid world). If this is the case, those worlds are still useful but limited. 2. The oracle formulation accommodates hints, planning, state tracking and history pr

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · AI-based Problem Solving and Planning