Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most

Nick Merrill; Jaeho Lee; Ezra Karger

arXiv:2605.22672·cs.AI·May 22, 2026

Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most

Nick Merrill, Jaeho Lee, Ezra Karger

PDF

TL;DR

This paper reveals that more capable language models tend to produce worse forecasts in scenarios with superlinear growth and tail risks, especially in the upper tail, challenging conventional evaluation metrics.

Contribution

It introduces a new benchmark, FBSim, demonstrating inverse scaling in forecasting performance of LLMs on complex, real-world datasets and synthetic simulations.

Findings

01

More capable models shift forecasts upward in the upper tail, leading to worse calibration.

02

Inverse scaling is consistent across synthetic and real-world datasets, including COVID-19 and financial markets.

03

Conventional metrics miss this inverse scaling; tail-inclusive measures reveal the effect.

Abstract

We document inverse scaling in LLMs on forecasting problems whose underlying time series exhibit superlinear growth and tail risk of regime change, a structure common in finance and epidemiology. On these tasks, more capable models produce worse distributional forecasts. The pattern appears on ForecastBench-Sim (FBSim), a contamination-free, simulated-world benchmark we release, in forecasting synthetic SIR epidemics with a matched linear control, and replicates in real-world datasets on COVID-19, measles, housing markets, and hyperinflation. A per-quantile decomposition shows the failure concentrates at the upper tail, which more capable models shift upward to track aggressive extrapolations of growth, while the lower tail stays put. A within-family study of Llama-3.1 shows that both model scale and post-training independently contribute to this effect. Domain knowledge does not…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.