Automating Forecasting Question Generation and Resolution for AI Evaluation
Nikos I. Bosse, Peter M\"uhlbacher, Jack Wildman, Lawrence Phillips, Dan Schwarz

TL;DR
This paper introduces an automated system using large language models to generate and resolve diverse forecasting questions at scale, significantly improving the efficiency and quality of AI evaluation in probabilistic forecasting.
Contribution
The authors develop a novel LLM-powered system for automatic generation and resolution of forecasting questions, surpassing human-curated platforms in accuracy and diversity.
Findings
System produces 96% verifiable questions, exceeding Metaculus.
Questions are resolved with 95% accuracy.
Forecasting agents perform better with advanced LLMs, improving Brier scores.
Abstract
Forecasting future events is highly valuable in decision-making and is a robust measure of general intelligence. As forecasting is probabilistic, developing and evaluating AI forecasters requires generating large numbers of diverse and difficult questions, and accurately resolving them. Previous efforts to automate this laborious work relied on recurring data sources (e.g., weather, stocks), limiting diversity and utility. In this work, we present a system for generating and resolving high-quality forecasting questions automatically and at scale using LLM-powered web research agents. We use this system to generate 1499 diverse, real-world forecasting questions, and to resolve them several months later. We estimate that our system produces verifiable, unambiguous questions approximately 96% of the time, exceeding the rate of Metaculus, a leading human-curated forecasting platform. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExpert finding and Q&A systems · Topic Modeling · Mobile Crowdsensing and Crowdsourcing
