LongBench: Evaluating Robotic Manipulation Policies on Real-World Long-Horizon Tasks
Xueyao Chen, Jingkai Jia, Tong Yang, Yibo Fu, Wei Li, and Wenqiang Zhang

TL;DR
LongBench is a comprehensive real-world benchmark with over 1,000 episodes designed to evaluate robotic manipulation policies over long horizons, focusing on robustness and context-dependent reasoning.
Contribution
The paper introduces LongBench, a new real-world benchmark for long-horizon manipulation tasks, enabling detailed analysis of robustness and contextual reasoning in robotic policies.
Findings
Performance in fully observable settings correlates with execution robustness.
Context-dependent difficulty varies and is not always mitigated by memory-based methods.
Long-horizon performance is influenced by multiple factors, not a single one.
Abstract
Robotic manipulation policies often degrade over extended horizons, yet existing benchmarks provide limited insight into why such failures occur. Most prior benchmarks are either simulation-based or report aggregate success, making it difficult to disentangle the distinct sources of temporal difficulty in real-world execution. We introduce LongBench, a real-world benchmark for evaluating long-horizon manipulation. LongBench consists of over 1,000 real-world episodes, covering two complementary regimes: Context-Independent (fully observable) and Context-Dependent (ambiguity-driven). By organizing tasks into capability- and ambiguity-specific subsets, LongBench enables mechanism-aware evaluation of execution robustness, temporal consistency, and context-dependent reasoning. Evaluating six state-of-the-art policies reveals that long-horizon performance is not governed by a single factor.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
