Forecasting Frontier Language Model Agent Capabilities

Govind Pimpale; Axel H{\o}jmark; J\'er\'emy Scheurer; Marius Hobbhahn

arXiv:2502.15850·cs.CL·March 4, 2025

Forecasting Frontier Language Model Agent Capabilities

Govind Pimpale, Axel H{\o}jmark, J\'er\'emy Scheurer, Marius Hobbhahn

PDF

Open Access

TL;DR

This paper evaluates methods to forecast the capabilities of language model agents, using a two-step approach to predict benchmark performance and applying it to frontier models to estimate future success rates.

Contribution

It introduces and validates a two-step forecasting approach for LM agent capabilities, providing predictions for future benchmark success rates of frontier models.

Findings

01

Forecast predicts 54% success for low-capability LM agents by 2026.

02

Forecast predicts 87% success for state-of-the-art LM agents by 2026.

03

The two-step approach is validated through backtesting on 38 LMs.

Abstract

As Language Models (LMs) increasingly operate as autonomous agents, accurately forecasting their capabilities becomes crucial for societal preparedness. We evaluate six forecasting methods that predict downstream capabilities of LM agents. We use "one-step" approaches that predict benchmark scores from input metrics like compute or model release date directly or "two-step" approaches that first predict an intermediate metric like the principal component of cross-benchmark performance (PC-1) and human-evaluated competitive Elo ratings. We evaluate our forecasting methods by backtesting them on a dataset of 38 LMs from the OpenLLM 2 leaderboard. We then use the validated two-step approach (Release Date $\to$ Elo $\to$ Benchmark) to predict LM agent performance for frontier models on three benchmarks: SWE-Bench Verified (software development), Cybench (cybersecurity assessment), and RE-Bench…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation · Natural Language Processing Techniques