Consistency Checks for Language Model Forecasters
Daniel Paleka, Abhimanyu Pallavi Sudhir, Alejandro Alvarez, Vineeth, Bhat, Adam Shen, Evan Wang, Florian Tram\`er

TL;DR
This paper introduces a novel consistency-based evaluation framework for language model forecasters, providing instant performance metrics that correlate with future ground truth and enabling ongoing benchmarking.
Contribution
It proposes a new arbitrage-based consistency metric, an automated evaluation system, and a forecasting benchmark that offers real-time performance assessment of language model forecasters.
Findings
Consistency metrics correlate with Brier scores
Automated system effectively measures prediction consistency
Benchmark enables long-term forecasting evaluation
Abstract
Forecasting is a task that is difficult to evaluate: the ground truth can only be known in the future. Recent work showing LLM forecasters rapidly approaching human-level performance begs the question: how can we benchmark and evaluate these forecasters instantaneously? Following the consistency check framework, we measure the performance of forecasters in terms of the consistency of their predictions on different logically-related questions. We propose a new, general consistency metric based on arbitrage: for example, if a forecasting AI illogically predicts that both the Democratic and Republican parties have 60% probability of winning the 2024 US presidential election, an arbitrageur can trade against the forecaster's predictions and make a profit. We build an automated evaluation system that generates a set of base questions, instantiates consistency checks from these questions,…
Peer Reviews
Decision·ICLR 2025 Oral
- Strong communication of the problem and proposed resolution - Detailed data generation for both already-resolved experiments and an open-ended future forecasting benchmark - Extensive discussion of decisions & limitations - Principled theory and evaluation. - Correlation plots indicate that the consistency scoring criteria could be used to generally select better forecasters. - Theoretical and practical method to improve consistency at evaluation time.
1. Test-time ArbitrageForecaster does not gain consistency in an emergent set of checks; only those it is specifically designed to address. This reduces the impact of the paper slightly. 2. New consistency metrics are rather easily Goodharted; one can imagine training a model on a vast set of synthetic forecasting problems respecting consistency checks in order to gain nearly perfect consistency scores with no improved knowledge of the future. This would limit the long-term efficacy of these met
* The manuscript is well-written, and the ideas are easy to follow. * The paper effectively addresses the challenge of evaluating LLM forecasters by proposing new metrics based on arbitrage and hypothesis testing. * It provides a thorough derivation and demonstrates a statistically significant correlation between the metrics proposed and Brier score. * The authors also curated a dataset of forecasting questions, enhancing the evaluation for the future works.
* The ArbitrageForecaster section would benefit from additional practical details, including parameter tuning, framework specifications, and methods for applying adjustments across various LLM models. Adding a framework visualization could further enhance clarity.
This paper explores the important subject of using LLMs to forecast future events. Clearly a lot of effort has gone into the work, from the data generation aspects formalizing and running the evaluation pipeline. Another strength of the paper, at least from what I understand, is that the authors have tried to avoid data contamination issues that are prevalent in the literature involving the use of LLMs. They have also created a new benchmark for events resolving in 2028, which could be benefic
I’m a bit confused by the focus of the study itself, which is to use logical consistency to evaluate a forecaster before resolution of the events themselves; presumably, the point of this is to evaluate forecasts sooner. But what is the value of using consistency as an early marker? There are likely confounding factors at play here that determine a model's affect on both forecasting consistency and the performance on the task. For instance, maybe a certain class of models (perhaps larger or trai
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsSparse Evolutionary Training · Balanced Selection
