LLM-as-a-Prophet: Understanding Predictive Intelligence with Prophet Arena

Qingchuan Yang; Simon Mahns; Sida Li; Anri Gu; Jibang Wu; Haifeng Xu

arXiv:2510.17638·cs.AI·December 23, 2025

LLM-as-a-Prophet: Understanding Predictive Intelligence with Prophet Arena

Qingchuan Yang, Simon Mahns, Sida Li, Anri Gu, Jibang Wu, Haifeng Xu

PDF

Open Access 3 Reviews

TL;DR

This paper explores the potential of large language models to serve as predictive tools for real-world events, introducing a new benchmark and revealing both strengths and limitations in their forecasting abilities.

Contribution

It introduces Prophet Arena, a comprehensive benchmark for evaluating LLMs' forecasting capabilities across diverse tasks and stages.

Findings

01

LLMs show promising forecasting accuracy and confidence.

02

Key limitations include event recall errors and data misunderstanding.

03

Market-based information aggregation outperforms LLMs near event resolution.

Abstract

Forecasting is not only a fundamental intellectual pursuit but also is of significant importance to societal systems such as finance and economics. With the rapid advances of large language models (LLMs) trained on Internet-scale data, it raises the promise of employing LLMs to forecast real-world future events, an emerging paradigm we call "LLM-as-a-Prophet". This paper systematically investigates such predictive intelligence of LLMs. To this end, we build Prophet Arena, a general evaluation benchmark that continuously collects live forecasting tasks and decomposes each task into distinct pipeline stages, in order to support our controlled and large-scale experimentation. Our comprehensive evaluation reveals that many LLMs already exhibit impressive forecasting capabilities, reflected in, e.g., their small calibration errors, consistent prediction confidence and promising market…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. The paper introduces a useful live benchmark for measuring LLM capabilities via forecasting open Kalshi events. 2. I like the relative advantage based metrics used for comparing language models. This mitigates issues in existing benchmarks where questions can vary in difficulty (sometimes not even being "forecasting" questions as future information is leaked). 3. The paper has detailed analysis of LLM forecasting behaviour and interesting insights across diverse ablations in Section 4 and t

Weaknesses

1. The paper does not show awareness of existing literature in LLM forecasting. For example, probabilistic forecasting is mentioned as the first "distinguishing feature" of the benchmark in the introduction. However, this has been the standard used in existing papers in LLM forecasting [1]. I am also concerned about the supposed "introduction" of "LLM-as-a-prophet paradigm". LLM forecasting has been an active area of study for the last 3 years [2]. I do not see the value of adding a new term, es

Reviewer 02Rating 6Confidence 3

Strengths

1. The manuscript presents novel paradigm: which is real-world, contamination-free forecasting benchmark. 2. The manuscript also presents comprehensive metrics: which is a good blend of accuracy and calibration. 3. Good part is about the open-sourced subset for reproducibility.

Weaknesses

1. The Event recall errors and approximate temporal memory which could be a potential weakness. 2. The weakness Conservative probability estimates vs markets. 3. Dependence on search/source quality; not all domains benefit equally. 4. Limited profitability (returns < 1). 5. Incomplete foresight near event resolution; calibration still imperfect.

Reviewer 03Rating 6Confidence 3

Strengths

- This paper proposes a live and realistic benchmark that directly measures LLMs’ forecasting ability on future real-world events, highly relevant to industrial applications. - Strong empirical design including probabilistic scoring, calibration, and market-based evaluation, providing a multi-angle understanding of model forecasting capability. - Clear analysis connecting reasoning quality and belief update dynamics. - Well-articulated motivation and clean methodology presentation.

Weaknesses

- No measurement of alignment between human judges and model judges, raising concerns about rating validity. - Reliance on a single web-search agent without explicit evidence quality assessment may confound reasoning evaluation.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStock Market Forecasting Methods · Forecasting Techniques and Applications · Explainable Artificial Intelligence (XAI)