LongBench: Evaluating Robotic Manipulation Policies on Real-World Long-Horizon Tasks

Xueyao Chen; Jingkai Jia; Tong Yang; Yibo Fu; Wei Li; and Wenqiang Zhang

arXiv:2604.16788·cs.RO·April 21, 2026

LongBench: Evaluating Robotic Manipulation Policies on Real-World Long-Horizon Tasks

Xueyao Chen, Jingkai Jia, Tong Yang, Yibo Fu, Wei Li, and Wenqiang Zhang

PDF

TL;DR

LongBench is a comprehensive real-world benchmark with over 1,000 episodes designed to evaluate robotic manipulation policies over long horizons, focusing on robustness and context-dependent reasoning.

Contribution

The paper introduces LongBench, a new real-world benchmark for long-horizon manipulation tasks, enabling detailed analysis of robustness and contextual reasoning in robotic policies.

Findings

01

Performance in fully observable settings correlates with execution robustness.

02

Context-dependent difficulty varies and is not always mitigated by memory-based methods.

03

Long-horizon performance is influenced by multiple factors, not a single one.

Abstract

Robotic manipulation policies often degrade over extended horizons, yet existing benchmarks provide limited insight into why such failures occur. Most prior benchmarks are either simulation-based or report aggregate success, making it difficult to disentangle the distinct sources of temporal difficulty in real-world execution. We introduce LongBench, a real-world benchmark for evaluating long-horizon manipulation. LongBench consists of over 1,000 real-world episodes, covering two complementary regimes: Context-Independent (fully observable) and Context-Dependent (ambiguity-driven). By organizing tasks into capability- and ambiguity-specific subsets, LongBench enables mechanism-aware evaluation of execution robustness, temporal consistency, and context-dependent reasoning. Evaluating six state-of-the-art policies reveals that long-horizon performance is not governed by a single factor.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.