Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most
Nick Merrill, Jaeho Lee, Ezra Karger

TL;DR
This paper reveals that more capable language models tend to produce worse forecasts in scenarios with superlinear growth and tail risks, especially in the upper tail, challenging conventional evaluation metrics.
Contribution
It introduces a new benchmark, FBSim, demonstrating inverse scaling in forecasting performance of LLMs on complex, real-world datasets and synthetic simulations.
Findings
More capable models shift forecasts upward in the upper tail, leading to worse calibration.
Inverse scaling is consistent across synthetic and real-world datasets, including COVID-19 and financial markets.
Conventional metrics miss this inverse scaling; tail-inclusive measures reveal the effect.
Abstract
We document inverse scaling in LLMs on forecasting problems whose underlying time series exhibit superlinear growth and tail risk of regime change, a structure common in finance and epidemiology. On these tasks, more capable models produce worse distributional forecasts. The pattern appears on ForecastBench-Sim (FBSim), a contamination-free, simulated-world benchmark we release, in forecasting synthetic SIR epidemics with a matched linear control, and replicates in real-world datasets on COVID-19, measles, housing markets, and hyperinflation. A per-quantile decomposition shows the failure concentrates at the upper tail, which more capable models shift upward to track aggressive extrapolations of growth, while the lower tail stays put. A within-family study of Llama-3.1 shows that both model scale and post-training independently contribute to this effect. Domain knowledge does not…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
