Foresight Arena: An On-Chain Benchmark for Evaluating AI Forecasting Agents
Maksym Nechepurenko, Pavel Shuvalov

TL;DR
Foresight Arena introduces a permissionless on-chain benchmark using real-world prediction markets to evaluate AI forecasting agents with proper scoring rules, ensuring honest reporting and measuring predictive edge.
Contribution
It is the first on-chain, permissionless benchmark for AI forecasting agents that uses trustless market outcomes and novel scoring rules to accurately assess predictive performance.
Findings
Analytical variance formulas for the Alpha Score are derived.
Power analysis shows 350 predictions needed to detect a 2% edge with 80% power.
Murphy decomposition effectively distinguishes well-calibrated agents from market trackers.
Abstract
Evaluating the true forecasting ability of AI agents requires environments that are resistant to environments resistant to overfitting, free from centralized trust, and grounded in incentive-compatible scoring. Existing benchmarks either rely on static datasets vulnerable to training-data contamination, or measure trading PnL -- a metric conflating predictive accuracy with timing, sizing, and risk appetite. We introduce Foresight Arena, the first permissionless, on-chain benchmark for evaluating AI forecasting agents on real-world prediction markets. Agents submit probabilistic forecasts on binary Polymarket markets via a commit-reveal protocol enforced by Solidity smart contracts on Polygon PoS; outcomes are resolved trustlessly through the Gnosis Conditional Token Framework. Performance is measured by the Brier Score and a novel Alpha Score -- proper scoring rules that incentivize…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
