OracleProto: A Reproducible Framework for Benchmarking LLM Native Forecasting via Knowledge Cutoff and Temporal Masking
Yiding Ma, Chengyun Ruan, Kaibo Huang, Zhongliang Yang, Linna Zhou

TL;DR
OracleProto is a reproducible framework that evaluates large language models' forecasting abilities by reconstructing past events into time-bound samples, enabling fair comparison and reducing information leakage.
Contribution
It introduces a novel, reproducible benchmarking method that distinguishes genuine forecasting from learned facts, with controlled leakage and hierarchical scoring.
Findings
Distinguishes forecasting quality, stability, and efficiency across models.
Reduces residual information leakage to below 1%.
Provides a reusable, auditable dataset for model evaluation.
Abstract
Large language models are moving from static text generators toward real-world decision-support systems, where forecasting is a composite capability that links information gathering, evidence integration, situational judgment, and action-oriented decision making. This capability is in broad demand across finance, policy, industry, and scientific research, yet its evaluation remains difficult: live benchmarks evaluate forecasts before answers exist, making them the cleanest way to measure forecasting ability, but they expire once events resolve; retrospective benchmarks are reproducible, but they cannot reliably distinguish genuine forecasting from facts a model may have already learned during pretraining. Prompting models to "pretend not to know" cannot replace a genuine knowledge boundary. We propose OracleProto, a reproducible framework for evaluating LLM native forecasting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
