Beyond Needle(s) in the Embodied Haystack: Environment, Architecture, and Training Considerations for Long Context Reasoning

Bosung Kim; Prithviraj Ammanabrolu

arXiv:2505.16928·cs.AI·February 20, 2026

Beyond Needle(s) in the Embodied Haystack: Environment, Architecture, and Training Considerations for Long Context Reasoning

Bosung Kim, Prithviraj Ammanabrolu

PDF

Open Access 3 Reviews

TL;DR

This paper introduces $ ty$-THOR, a comprehensive framework for long-horizon embodied AI tasks, including a new benchmark, dataset, and architectural strategies to enhance long-context reasoning and planning capabilities.

Contribution

It presents a novel long-horizon dataset and benchmark, along with architectural adaptations for LLMs, to advance long-term reasoning in embodied AI.

Findings

01

Challenges in long-horizon reasoning are highlighted.

02

Architectural techniques improve long-context understanding.

03

Benchmark provides a new standard for extended embodied tasks.

Abstract

We introduce $\infty$ -THOR, a new framework for long-horizon embodied tasks that advances long-context understanding in embodied AI. $\infty$ -THOR provides: (1) a generation framework for synthesizing scalable, reproducible, and unlimited long-horizon trajectories; (2) a novel embodied QA task, Needle(s) in the Embodied Haystack, where multiple scattered clues across extended trajectories test agents' long-context reasoning ability; and (3) a long-horizon dataset and benchmark suite featuring complex tasks that span hundreds of environment steps, each paired with ground-truth action sequences. To enable this capability, we explore architectural adaptations, including interleaved Goal-State-Action modeling, context extension techniques, and Context Parallelism, to equip LLM-based agents for extreme long-context reasoning and interaction. Experimental results and analyses highlight the…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 2

Strengths

Strong problem focus - long-horizon memory in embodied AI, timely and hard. THOR is scalable and open can synthesize endless, ultra-long trajectories with full action traces. NiEH benchmark is unique tests recall of scattered clues across hundreds of steps, bridging vision, language, and long-term reasoning. Task design is clever enforces early–late dependencies, no shortcuts. Interleaved Goal–State–Action model clean, unified architecture, handles temporal context elegantly. Rigorous experimen

Weaknesses

Relies only on context extension for memory. No exploration of retrieval or hierarchical memory; limits scalability beyond 512k tokens. No direct baseline against modular vision–language–action models. Claim of superiority for interleaved modeling not empirically proven. Lacks external dynamics: all environment changes are agent-driven. No tests of memory for unobserved or changing scenes. Models fail beyond 0.5M tokens, multi-evidence QA accuracy drops sharply, and long runs often collapse. N

Reviewer 02Rating 4Confidence 3

Strengths

1. The motivation for this work is very solid and timely, as current models for embodied planning and decision making are struggling with long context, often confined to short terms tasks without the ability to perform long horizon optimization. Although the task of recalling details in long horizon action sequences is not directly aiming at the core of the planning problem, it also points at a capability in the right direction. 2. The interleaved Goal–State–Action modeling idea is interesting.

Weaknesses

1. The model claims to explore "ARCHITECTURES FOR LONG-HORIZON VISION-LANGUAGE-ACTION MODELS". However, there is no explicit evaluation of the core task for VLAs: planning. Instead, authors only evaluate the model on the Needles in the Embodied Haystack task of long horizon question answering. 2. The performance gains on this single task does not fully justify the complex architecture changes made. Perhaps one way to more concretely justify the modeling is by experimenting on other tasks more

Reviewer 03Rating 4Confidence 3

Strengths

- The long context and multi goal trajectory generation method and question-answer pairs generation method are novel and could contribute to the VLA/VLM community. - The empirical results reflect the poor performance of current methods in long-context settings.

Weaknesses

`W1`: Overall, the paper is not easy to follow, and the presentation lacks clarity in explaining what was actually done. Significant effort is required to understand precisely the main contributions and methodology. The writing would benefit from a more direct, transparent exposition of the key ideas and experimental details. `W2`: Ultimately, the results in Figure 4 suggest that current architectures (incl. long context solutions) struggle with contexts longer than the training / FT context

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDesign Education and Practice