Scaling Open-Ended Reasoning to Predict the Future

Nikhil Chandak; Shashwat Goel; Ameya Prabhu; Moritz Hardt; Jonas Geiping

arXiv:2512.25070·cs.LG·January 6, 2026

Scaling Open-Ended Reasoning to Predict the Future

Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, Jonas Geiping

PDF

Open Access 3 Models 1 Datasets 3 Reviews

TL;DR

This paper introduces OpenForesight, a scalable approach to train language models for open-ended future predictions using synthesized news-based data, improving forecasting accuracy and calibration.

Contribution

The work presents a novel automated data synthesis method, a specialized forecasting model, and demonstrates improved calibration and accuracy over larger models.

Findings

01

OpenForecaster 8B matches larger proprietary models in forecasting tasks.

02

Forecasting training improves model calibration and consistency.

03

Open-source release of models, code, and data facilitates further research.

Abstract

High-stakes decision making involves reasoning under uncertainty about the future. In this work, we train language models to make predictions on open-ended forecasting questions. To scale up training data, we synthesize novel forecasting questions from global events reported in daily news, using a fully automated, careful curation recipe. We train the Qwen3 thinking models on our dataset, OpenForesight. To prevent leakage of future information during training and evaluation, we use an offline news corpus, both for data generation and retrieval in our forecasting system. Guided by a small validation set, we show the benefits of retrieval, and an improved reward function for reinforcement learning (RL). Once we obtain our final forecasting system, we perform held-out testing between May to August 2025. Our specialized model, OpenForecaster 8B, matches much larger proprietary models, with…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

1. OpenForesight generation is a scalable, automated pipeline for generating open-ended forecasting questions. 2. The ablation studies provide a thorough validation of design choices like filtering, reward, and retrieval.

Weaknesses

1. The data and models can be heavily skewed towards topics news media covers. Might perform better on domains like politics, etc, but not on long-term cultural shifts, scientific breakthroughs, etc. 2. The filtering pipeline removes ~90% of generated questions. This makes the generation process very expensive and requires further analysis on why the generation model fails so frequently. Will in-context learning help? 3. The paper doesn't justify why numeric answers are filtered out beyond avoi

Reviewer 02Rating 4Confidence 4

Strengths

The paper tackles a valuable and difficult problem: scalable, open-ended forecasting. It presents a complete, end-to-end system, meticulously validating each component choice, from data curation to the final RL reward. The data generation and curation pipeline is fully-automated and scalable, and it thoughtfully designed to avoid common pitfalls like data leakage (by using an offline corpus) and self-preference bias (by using different models for generation and filtering). The ablation studie

Weaknesses

The paper's novelty is primarily in the systematic combination of existing techniques which makes it hard to pinpoint the exact contribution The experiments on specialized models are confined to a single model family, Qwen3. This limited scope makes it unclear how dependent the results are on the base Qwen3 model architecture and whether the pipeline and 'Accuracy + Brier' reward would be equally effective if applied to other popular open-weight models. How dependent are the results on the base

Reviewer 03Rating 6Confidence 3

Strengths

Practical problem: Open-ended prediction with probability output is of clear real-world value. Scalable data synthesis: From hundreds of thousands of articles to ~60k high-precision samples via multi-stage filtering (validity → best-candidate selection → leakage cleaning → answer-type control), maintaining quality and breadth. Training recipe that balances goals: SFT warm-start expands solution diversity/ceiling; the joint reward avoids the “accuracy-only hurts calibration / calibration-only s

Weaknesses

Compute/cost transparency is limited: missing token counts, steps, GPU-hours for SFT/RL, and end-to-end latency (including retrieval). No human/market baselines: comparisons to aggregated human forecasters or prediction-market probabilities are absent. Source/language bias risk: test set from five English outlets—please assess topic/region balance and consider multilingual evaluation. Long-horizon behavior unclear: what happens for resolutions ≥6/12 months? Safety: unclear whether sensitive

Code & Models

Models

Datasets

nikhilchandak/OpenForesight
dataset· 522 dl
522 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsForecasting Techniques and Applications · Topic Modeling · Misinformation and Its Impacts