Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks
Lawrence Keunho Jang, Jing Yu Koh, Daniel Fried, Ruslan Salakhutdinov

TL;DR
Odysseys introduces a benchmark of 200 real-world, long-horizon web tasks to evaluate web agents' sustained multi-site reasoning and efficiency, addressing limitations of existing short-task benchmarks.
Contribution
The paper presents Odysseys, a new benchmark with a rubric-based evaluation for long-horizon web tasks, and assesses current models' performance and efficiency on these tasks.
Findings
Strongest models achieve only 44.5% success rate.
Rubric-based evaluation aligns better with human judgment.
Agents exhibit only 1.15% efficiency, indicating room for improvement.
Abstract
Existing web agent benchmarks have largely converged on short, single-site tasks that frontier models are approaching saturation on. However, real world web use consists of long-horizon, multi-site workflows. Common web navigation tasks, such as comparing products across different domains, planning trips across multiple services, or summarizing information from multiple search queries, require sustained context and cross-site reasoning over potentially hours of browsing. To capture and evaluate such behaviors, we introduce Odysseys: a benchmark of 200 long-horizon web tasks derived from real world browsing sessions evaluated on the live Internet. We find that binary pass/fail evaluation is inadequate for long-horizon settings and introduce a rubric-based evaluation, annotating each Odysseys task with an average of 6.1 graded rubrics. We demonstrate that this yields higher agreement with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
