Pitfalls in Evaluating Language Model Forecasters
Daniel Paleka, Shashwat Goel, Jonas Geiping, Florian Tram\`er

TL;DR
This paper highlights the challenges and pitfalls in evaluating large language model forecasters, emphasizing the need for more rigorous methods to accurately assess their real-world forecasting capabilities.
Contribution
It identifies key evaluation issues such as temporal leakage and extrapolation difficulties, providing systematic analysis and concrete examples to improve future assessment practices.
Findings
Evaluation results can be unreliable due to temporal leakage
Current methods may not accurately reflect real-world forecasting performance
Rigorous evaluation methodologies are necessary for trustworthy assessments
Abstract
Large language models (LLMs) have recently been applied to forecasting tasks, with some works claiming these systems match or exceed human performance. In this paper, we argue that, as a community, we should be careful about such conclusions as evaluating LLM forecasters presents unique challenges. We identify two broad categories of issues: (1) difficulty in trusting evaluation results due to many forms of temporal leakage, and (2) difficulty in extrapolating from evaluation performance to real-world forecasting. Through systematic analysis and concrete examples from prior work, we demonstrate how evaluation flaws can raise concerns about current and future performance claims. We argue that more rigorous evaluation methodologies are needed to confidently assess the forecasting abilities of LLMs.
Peer Reviews
Decision·ICLR 2026 Poster
S1: I think this paper is timely is quite relevant given the surge of papers in forecasting.This is an essential paper for the community. The primary focus of the paper is to bring forth the flaws and issues in evaluation and benchmarks. S2: Paper demonstrates various issues with prior work’s evaluations such as model cut-off date, bias in retrieval, leakag. The paper reads well. And is structured for ease of understanding.
W1. My main concern is that this is a meta analysis paper, where main contribution is the analysis and synthesizing prior work from the lens of evaluation. It is not a new artifact in a traditional sense like algo, data, methods etc. W2. Some claims are supported through prompts/evidence, while others, like LLMs gaming the benchmarks, etc, are extrapolated/opinionated about.
1. The paper explores an interesting topic on the evaluation of LLM forecasters. The problem is well scoped and the structure is clear. It groups the challenges into two main categories and then discusses more specific cases under each. 2. The challenges identified by the authors are valid and important, and experiments in the subsequent work [1] further confirm them. [1]FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction
1. Overall, I consider this paper closer to a position paper: its core contribution lies in raising the problem, summarizing the challenges, and offering methodological recommendations, while lacking new methods, benchmarks, or rigorous, controlled experimental evaluation to substantiate the main claims. 2. The paper focuses primarily on backtesting or retrodiction, and the authors contend that “The gold standard for evaluating a forecaster involves running it on unresolved questions, waiting u
I find the paper very well written and the arguments are presented cleanly and convincingly. In particular, the main argument is thoughtful. It systematizes failure modes for evaluating LLM forecasters into two categories—temporal leakage in backtests and misinterpretation of benchmark gains—which is a novel perspective in the literature. For each of the pitfalls, the paper provides concrete examples. For instance, on logical leakage, the authors show that the very fact that a question is being
While I find the arguments convincing, the paper could be stronger in offering more systematic and empirical studies. For example, regarding logical leakage, I wonder if the impact of that could be precisely measured (in the context of any of the previous benchmarks). Similarly, for retrieval leakage, the paper offers a few examples. However, a broader measurement could strengthen the argument. For each pitfall, the paper suggests various potential solutions. It would be nice to implement & te
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
