Pitfalls in Evaluating Language Model Forecasters

Daniel Paleka; Shashwat Goel; Jonas Geiping; Florian Tram\`er

arXiv:2506.00723·cs.LG·June 3, 2025

Pitfalls in Evaluating Language Model Forecasters

Daniel Paleka, Shashwat Goel, Jonas Geiping, Florian Tram\`er

PDF

Open Access 3 Reviews

TL;DR

This paper highlights the challenges and pitfalls in evaluating large language model forecasters, emphasizing the need for more rigorous methods to accurately assess their real-world forecasting capabilities.

Contribution

It identifies key evaluation issues such as temporal leakage and extrapolation difficulties, providing systematic analysis and concrete examples to improve future assessment practices.

Findings

01

Evaluation results can be unreliable due to temporal leakage

02

Current methods may not accurately reflect real-world forecasting performance

03

Rigorous evaluation methodologies are necessary for trustworthy assessments

Abstract

Large language models (LLMs) have recently been applied to forecasting tasks, with some works claiming these systems match or exceed human performance. In this paper, we argue that, as a community, we should be careful about such conclusions as evaluating LLM forecasters presents unique challenges. We identify two broad categories of issues: (1) difficulty in trusting evaluation results due to many forms of temporal leakage, and (2) difficulty in extrapolating from evaluation performance to real-world forecasting. Through systematic analysis and concrete examples from prior work, we demonstrate how evaluation flaws can raise concerns about current and future performance claims. We argue that more rigorous evaluation methodologies are needed to confidently assess the forecasting abilities of LLMs.

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

S1: I think this paper is timely is quite relevant given the surge of papers in forecasting.This is an essential paper for the community. The primary focus of the paper is to bring forth the flaws and issues in evaluation and benchmarks. S2: Paper demonstrates various issues with prior work’s evaluations such as model cut-off date, bias in retrieval, leakag. The paper reads well. And is structured for ease of understanding.

Weaknesses

W1. My main concern is that this is a meta analysis paper, where main contribution is the analysis and synthesizing prior work from the lens of evaluation. It is not a new artifact in a traditional sense like algo, data, methods etc. W2. Some claims are supported through prompts/evidence, while others, like LLMs gaming the benchmarks, etc, are extrapolated/opinionated about.

Reviewer 02Rating 4Confidence 2

Strengths

1. The paper explores an interesting topic on the evaluation of LLM forecasters. The problem is well scoped and the structure is clear. It groups the challenges into two main categories and then discusses more specific cases under each. 2. The challenges identified by the authors are valid and important, and experiments in the subsequent work [1] further confirm them. [1]FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction

Weaknesses

1. Overall, I consider this paper closer to a position paper: its core contribution lies in raising the problem, summarizing the challenges, and offering methodological recommendations, while lacking new methods, benchmarks, or rigorous, controlled experimental evaluation to substantiate the main claims. 2. The paper focuses primarily on backtesting or retrodiction, and the authors contend that “The gold standard for evaluating a forecaster involves running it on unresolved questions, waiting u

Reviewer 03Rating 8Confidence 5

Strengths

I find the paper very well written and the arguments are presented cleanly and convincingly. In particular, the main argument is thoughtful. It systematizes failure modes for evaluating LLM forecasters into two categories—temporal leakage in backtests and misinterpretation of benchmark gains—which is a novel perspective in the literature. For each of the pitfalls, the paper provides concrete examples. For instance, on logical leakage, the authors show that the very fact that a question is being

Weaknesses

While I find the arguments convincing, the paper could be stronger in offering more systematic and empirical studies. For example, regarding logical leakage, I wonder if the impact of that could be precisely measured (in the context of any of the previous benchmarks). Similarly, for retrieval leakage, the paper offers a few examples. However, a broader measurement could strengthen the argument. For each pitfall, the paper suggests various potential solutions. It would be nice to implement & te

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques