OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows

Weixuan Wang; Dongge Han; Daniel Madrigal Diaz; Jin Xu; Victor R\"uhle; Saravan Rajmohan

arXiv:2508.09124·cs.CL·August 13, 2025

OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows

Weixuan Wang, Dongge Han, Daniel Madrigal Diaz, Jin Xu, Victor R\"uhle, Saravan Rajmohan

PDF

Open Access 4 Reviews

TL;DR

OdysseyBench is a new comprehensive benchmark designed to evaluate large language model agents on complex, long-horizon office workflows, addressing limitations of existing atomic task benchmarks.

Contribution

The paper introduces OdysseyBench, a novel benchmark with real-world inspired tasks and a multi-agent framework HomerAgents for scalable benchmark creation.

Findings

01

OdysseyBench effectively challenges state-of-the-art LLM agents.

02

It provides a more accurate assessment of LLM capabilities in complex workflows.

03

The benchmark covers diverse office applications.

Abstract

Autonomous agents powered by large language models (LLMs) are increasingly deployed in real-world applications requiring complex, long-horizon workflows. However, existing benchmarks predominantly focus on atomic tasks that are self-contained and independent, failing to capture the long-term contextual dependencies and multi-interaction coordination required in realistic scenarios. To address this gap, we introduce OdysseyBench, a comprehensive benchmark for evaluating LLM agents on long-horizon workflows across diverse office applications including Word, Excel, PDF, Email, and Calendar. Our benchmark comprises two complementary splits: OdysseyBench+ with 300 tasks derived from real-world use cases, and OdysseyBench-Neo with 302 newly synthesized complex tasks. Each task requires agent to identify essential information from long-horizon interaction histories and perform multi-step…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 3

Strengths

1. The benchmark construction approach is comprehensive: it not only derives tasks from existing task intents (via HOMERAGENTS+ transforming atomic tasks from OfficeBench into context-rich scenarios) but also generates long-horizon tasks entirely from scratch (via HOMERAGENTS-NEO). Notably, both subsets (OdysseyBench+ and OdysseyBench-Neo) are accompanied by multi-day dialogue histories, which better simulate real-world user-agent interactions. 2. Testing human performance is well done. Human an

Weaknesses

1. Ablation studies for the OdysseyBench-Neo construction process are lacking, and key concepts (e.g., "surfers") lack concrete case illustrations. The paper proposes HOMERAGENTS-NEO to generate long-horizon tasks from scratch via four phases , but it does not validate the necessity of each component or compare its performance against simpler generation methods. Additionally, the role of "surfers" in collecting contextual information from office applications is described in abstract terms, and i

Reviewer 02Rating 2Confidence 4

Strengths

1. Addresses a Good question: The paper identifies a good limitation of current agent benchmarks: their focus on "atomic" tasks. It rightly argues for the need to evaluate agents on long-horizon workflows that require reasoning over accumulated 2. Good Evaluation: The paper provides an evaluation of numerous models. The inclusion of both long-context and various RAG strategies (raw vs. summary, session-level vs. chunk-level) provides a valuable analysis of how agents cope with long context. 4.

Weaknesses

1. Overclaiming and Imprecise Language: The paper's repeated claim that `OdysseyBench+` is derived from "real-world use cases" is an overstatement. It is a synthetic transformation of another benchmark (OfficeBench). This linguistic imprecision ("real-world" vs. "realistic simulation") undermines the paper's claims about the benchmark's grounding. 2. Misplaced Focus on Method over Benchmark: This is a benchmark paper, but the majority of the methodology (Section 3) is devoted to the HomerAgent

Reviewer 03Rating 4Confidence 3

Strengths

* The benchmarks contain both the manually-curated and synthesized queries, which is good for the evaluation of LLMs nowadays. * The paper is generally well-written and easy to follow.

Weaknesses

* The benchmark is built upon/inspired from the existing OfficeBench, but there exists no detailed analysis on: 1) the differences between the previous work and the current study; 2) the limitations of the previous benchmark and how they are handled in the current study; 3) the environment (infra) scaling issue long horizon, complex office task evaluation. * There exists no discussions on the performance/accuracy of the generator and verifier itself for synthesis/extension of office tasks, simil

Reviewer 04Rating 2Confidence 3

Strengths

1. This paper proposes a new benchmark, OdysseyBench, to evaluate the comprehensive capabilities of LLM agents in long-term office tasks, and covers a variety of real-world office applications: Word, Excel, PDF, Email, and Calendar. 2. The authors also propose HOMERAGENTS, a multi-agent automatic generation framework for scalably building long-term task benchmarks. Through environment exploration, the task generation and dialogue synthesis stages are automated. 3. Long-Horizon Complex tasks ass

Weaknesses

1. The HOMERAGENTS framework is not particularly novel — similar approaches that combine information collection and data generation have already been employed in several prior works, such as AppAgent and others. 2. Long-horizon complex tasks are indeed underexplored in previous datasets and benchmarks. However, as a benchmark, it should be more comprehensive; the current limited amount of data is insufficient to support a thorough evaluation of agents’ capabilities in complex task scenarios. 3.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Multimodal Machine Learning Applications