QuantSightBench: Evaluating LLM Quantitative Forecasting with Prediction Intervals

Jeremy Qin; Maksym Andriushchenko

arXiv:2604.15859·cs.LG·April 20, 2026

QuantSightBench: Evaluating LLM Quantitative Forecasting with Prediction Intervals

Jeremy Qin, Maksym Andriushchenko

PDF

TL;DR

This paper introduces QuantSightBench, a new benchmark for evaluating large language models' ability to generate accurate and calibrated prediction intervals for numerical forecasting across various domains.

Contribution

It proposes prediction intervals as a more effective evaluation format for forecasting and assesses multiple models, revealing systematic overconfidence and coverage shortcomings.

Findings

01

None of the 11 models achieved 90% coverage.

02

Top models reached around 75-79% coverage, below target.

03

Calibration worsens at extreme magnitudes, indicating overconfidence.

Abstract

Forecasting has become a natural benchmark for reasoning under uncertainty. Yet existing evaluations of large language models remain limited to judgmental tasks in simple formats, such as binary or multiple-choice questions. In practice, however, forecasting spans a far broader scope. Across domains such as economics, public health, and social demographics, decisions hinge on numerical estimates over continuous quantities, a capability that current benchmarks do not capture. Evaluating such estimates requires a format that makes uncertainty explicit and testable. We propose prediction intervals as a natural and rigorous interface for this purpose. They demand scale awareness, internal consistency across confidence levels, and calibration over a continuum of outcomes, making them a more suitable evaluation format than point estimates for numerical forecasting. To assess this capability,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.