Evaluating LLMs on Real-World Forecasting Against Expert Forecasters

Janna Lu

arXiv:2507.04562·cs.LG·August 6, 2025

Evaluating LLMs on Real-World Forecasting Against Expert Forecasters

Janna Lu

PDF

3 Reviews

TL;DR

This paper evaluates the forecasting accuracy of large language models against expert forecasters on real-world questions, revealing that while LLMs outperform the general crowd, they still lag behind top human experts.

Contribution

It provides a comprehensive comparison of state-of-the-art LLMs with expert forecasters on 464 real-world forecasting questions.

Findings

01

LLMs outperform the general crowd in forecasting accuracy.

02

LLMs still significantly underperform compared to top human experts.

03

Frontier models achieve better Brier scores than the human crowd.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, but their ability to forecast future events remains understudied. A year ago, large language models struggle to come close to the accuracy of a human crowd. I evaluate state-of-the-art LLMs on 464 forecasting questions from Metaculus, comparing their performance against top forecasters. Frontier models achieve Brier scores that ostensibly surpass the human crowd but still significantly underperform a group of experts.

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

- The paper includes a prospective and retrodictive evaluation and compares them against each other. This has been sorely missing in the literature, but oddly I don't think the authors of the paper realize how much value this provides to the field. E.g., it partly resolves some of the key concerns highlighted by Paleka et al. (see weaknesses section for this citation). This comparison should be greatly emphasized. The fact that retrodictive and prospective evaluations give consistent results wou

Weaknesses

- Line 28 says "There are two types of forecasting: predicting the future based on a few datapoints or heuristics, or making predictions with a traditional machine-learning model". This doesn't seem like a natural taxonomy. Where does time series forecasting fit in? Also, forecasting like that done on Metaculus isn't made with few datapoints; enormous amounts of data (albeit unstructured) goes into those forecasts. To the best of my knowledge, the standard term for Metaculus-style forecasting is

Reviewer 02Rating 0Confidence 4

Strengths

1. The paper contributes to measurement of LLM abilities in real-world reasoning and forecasting. The contributed dataset could potentially be of great value. 2. It is a timely problem to study. 3. Experiments consider most of the frontier closed source models.

Weaknesses

1. The paper states various things without attribution/substantiation/citation. A couple of examples (among many) are as follows. Lines 84 - 88, about mental model, ensemble of LLMs are better etc. Line 105-106 statements like “ An LLM does better when fed news articles from AskNews over Perplexity” 2. I think the claim about preventing contamination as events haven’t occured yet based on LLM cutoff date is not entirely accurate. This is because the experiments are comparing LLMs performance wi

Reviewer 03Rating 2Confidence 4

Strengths

The paper is overall well written with a clean main result. The measurements are standard and appear sound. The paper chooses questions resolved after each model’s knowledge cutoff and inputs are filtered to short, pre-resolution news summaries. This prevents potential leakage. The paper evaluates a wide range of frontier models, including o3-pro, Deepseek-V3 and Claude-3.6-Sonnet.

Weaknesses

I find the paper overall quite weak in its data collection and evaluation methodology. On the data set, it only collections about 400 questions from one platform (i.e., Metaculus) with most questions from Jul–Dec 2024. It's unclear to me if the dataset is diverse enough. Compared with prior work like https://openreview.net/forum?id=FlcdW7NPRY, 400 is also quite a small sample size. Regarding evaluation, the paper considers only 2 prompt styles and offers no attempt to optimize the overall pr

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.