TL;DR
PolyBench is a comprehensive benchmark dataset and evaluation framework for testing large language models' ability to predict and trade on live prediction market data, integrating multimodal signals and financial metrics.
Contribution
It introduces PolyBench, a novel multimodal benchmark with real market data, and evaluates LLMs' forecasting and trading performance in a realistic setting.
Findings
Only two models achieved positive financial returns.
Models showed a gap between language fluency and probabilistic reasoning.
PolyBench provides a contamination-proof, financially-grounded evaluation standard.
Abstract
Predicting real-world events from live market signals demands systems that fuse qualitative news with quantitative order-book dynamics under strict temporal discipline -- a challenge existing benchmarks fail to capture. We present \textbf{PolyBench}, a multimodal benchmark derived from Polymarket that records point-in-time cross-sections of 38,666 binary prediction markets spanning 4,997 events, synchronously coupling each snapshot with a Central Limit Order Book (CLOB) state and a real-time news stream. Using PolyBench, we evaluate seven state-of-the-art Large Language Models -- spanning open- and closed-source families -- generating 36,165 predictions under identical, timestamp-locked market states collected between February 6 and 12, 2026. Our multidimensional framework assesses directional accuracy, our proposed Confidence-Weighted Return (CWR), Annualized Percentage Yield (APY),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
