APEX-Agents

Bertie Vidgen; Austin Mann; Abby Fennelly; John Wright Stanly; Lucas Rothman; Marco Burstein; Julien Benchek; David Ostrofsky; Anirudh Ravichandran; Debnil Sur; Neel Venugopal; Alannah Hsia; Isaac Robinson; Calix Huang; Olivia Varones; Daniyal Khan; Michael Haines; Austin Bridges; Jesse Boyle; Koby Twist; Zach Richards; Chirag Mahapatra; Brendan Foody; Osvald Nitski

arXiv:2601.14242·cs.CL·February 24, 2026

APEX-Agents

Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman, Marco Burstein, Julien Benchek, David Ostrofsky, Anirudh Ravichandran, Debnil Sur, Neel Venugopal, Alannah Hsia, Isaac Robinson, Calix Huang, Olivia Varones, Daniyal Khan, Michael Haines

PDF

Open Access 1 Models 5 Datasets

TL;DR

APEX-Agents is a benchmark designed to evaluate AI agents' ability to perform complex, long-term tasks across multiple applications in realistic work environments, with open-source tools and a leaderboard.

Contribution

This paper introduces the APEX-Agents benchmark and infrastructure, enabling standardized assessment of AI agents on realistic, cross-application tasks with open-source resources.

Findings

01

Gemini 3 Flash achieves 24.0% Pass@1 score

02

Eight agents evaluated on the leaderboard

03

Open-sourced benchmark and infrastructure

Abstract

We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate lawyers. APEX-Agents requires agents to navigate realistic work environments with files and tools. We test eight agents for the leaderboard using Pass@1. Gemini 3 Flash (Thinking=High) achieves the highest score of 24.0%, followed by GPT-5.2 (Thinking=High), Claude Opus 4.5 (Thinking=High), and Gemini 3 Pro (Thinking=High). We open source the APEX-Agents benchmark (n=480) with all prompts, rubrics, gold outputs, files, and metadata. We also open source Archipelago, our infrastructure for agent execution and evaluation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
nitinsaini08/harfeast-env
model

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law · Explainable Artificial Intelligence (XAI) · Multi-Agent Systems and Negotiation