TimeSeek: Temporal Reliability of Agentic Forecasters
Hamza Mostafa, Om Shastri, and Dennis Lee

TL;DR
TimeSeek introduces a benchmark for assessing how the reliability of agentic language model forecasters varies throughout a prediction market’s lifecycle, highlighting the importance of time-aware evaluation and retrieval strategies.
Contribution
The paper presents a new benchmark and comprehensive analysis of LLM forecasters' temporal reliability across market stages, with insights into retrieval effects and ensemble approaches.
Findings
Models perform best early in markets and on high-uncertainty markets.
Web search improves overall forecast accuracy but can be detrimental in some cases.
Simple ensembles reduce errors but do not outperform the overall market.
Abstract
We introduce TimeSeek, a benchmark for studying how the reliability of agentic LLM forecasters changes over a prediction market's lifecycle. We evaluate 10 frontier models on 150 CFTC-regulated Kalshi binary markets at five temporal checkpoints, with and without web search, for 15,000 forecasts total. Models are most competitive early in a market's life and on high-uncertainty markets, but much less competitive near resolution and on strong-consensus markets. Web search improves pooled Brier Skill Score (BSS) for every model overall, yet hurts in 12% of model-checkpoint pairs, indicating that retrieval is helpful on average but not uniformly so. Simple two-model ensembles reduce error without surpassing the market overall. These descriptive results motivate time-aware evaluation and selective-deference policies rather than a single market snapshot or a uniform tool-use setting.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
